Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

arXiv CS Friday 05 June 2026, 04:00 UTC By Xin Wang, Liangtai Sun, Yaoming Zhu, Shuang Zhou, Jiaxing Liu, Fengjiao Chen, Lin Qiu, Xuezhi Cao, Xunliang Cai, Licheng Zhang, Zhendong Mao 1 min read

Key Points

Announce Type: new Abstract: Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it.

arXiv:2606.05920v1 Announce Type: new Abstract: Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.

Asuka-Bench (ORG) Multi-Round Refinement arXiv:2606.05920v1 Announce Type (ORG) UI (ORG)

Originally published by arXiv CS Read original →

Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

Related Stories

Musk Stock Fans Say ‘The More, The Better’ in SpaceX IPO Frenzy

Whale graveyard dating back five million years discovered

Whale graveyard dating back five million years discovered

SpaceX Leaves Some Banks Peeved at Junior Roles in IPO Lineup