Home › Science › Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

Science

Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Noor Nashid, Daniel Ding, Keheliya Gallaba, Ahmed E. Hassan, Ali Mesbah 1 min read

Key Points

arXiv:2511.11012v2 Announce Type: replace Abstract: Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these four state-of-the-art agents on 404 multi-hunk bugs from the PolyHunk dataset, yielding 1,616 repair trajectories for large-scale behavioral analysis. We employ fine-grained metrics to assess localization, repair accuracy, regression behavior, and operational dynamics across agents. We find that localization capability varies substantially, with Codex achieving the highest success rate (75.3%) and Qwen Code the lowest (40.4%). Repair accuracy also differs widely, ranging from 26.98% (Qwen Code) to 92.82% (Claude Code), and consistently declines with increasing bug dispersion and complexity (hunk divergence and spatial proximity). High-performing agents (Claude Code and Codex) demonstrate superior semantic consistency, achieving positive average regression reduction, whereas lower-performing agents often introduce new test failures. Notably, agents do not fail fast; failed repairs consume substantially more resources (33%-440% more input tokens) and require longer execution time (35%-330%). Additionally, we developed Maple to provide agents with repository-level context. Empirical results show that Maple improves repair accuracy of Gemini-cli by ~21% through enhanced localization. By analyzing fine-grained metrics and trajectory-level analysis, this study moves beyond accuracy to explain how coding agents localize, reason, and act during multi-hunk repair. Our findings underscore the impact of bug divergence and spatial proximity on multi-hunk repair success for coding agents.

LLM (ORG) Claude (PERSON) Codex (ORG) Gemini (ORG) Qwen Code (PERSON) Maple (ORG)

Originally published by arXiv CS Read original →

Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

Related Stories

'Voltron: Legendary Defender' turns 10 today, and we think this mecha robot reboot was just as good as 'Power Rangers' and 'Transformers'

Exclusive-GM may ditch LFP batteries for future EVs

Musk Stock Fans Say ‘The More, The Better’ in SpaceX IPO Frenzy

Musk Stock Fans Say ‘The More, The Better’ in SpaceX IPO Frenzy