Home Science Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair
Science

Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

Key Points

arXiv:2511.11012v2 Announce Type: replace Abstract: Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task.

arXiv:2511.11012v2 Announce Type: replace Abstract: Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these four state-of-the-art agents on 404 multi-hunk bugs from the PolyHunk dataset, yielding 1,616 repair trajectories for large-scale behavioral analysis. We employ fine-grained metrics to assess localization, repair accuracy, regression behavior, and operational dynamics across agents. We find that localization capability varies substantially, with Codex achieving the highest success rate (75.3%) and Qwen Code the lowest (40.4%). Repair accuracy also differs widely, ranging from 26.98% (Qwen Code) to 92.82% (Claude Code), and consistently declines with increasing bug dispersion and complexity (hunk divergence and spatial proximity). High-performing agents (Claude Code and Codex) demonstrate superior semantic consistency, achieving positive average regression reduction, whereas lower-performing agents often introduce new test failures. Notably, agents do not fail fast; failed repairs consume substantially more resources (33%-440% more input tokens) and require longer execution time (35%-330%). Additionally, we developed Maple to provide agents with repository-level context. Empirical results show that Maple improves repair accuracy of Gemini-cli by ~21% through enhanced localization. By analyzing fine-grained metrics and trajectory-level analysis, this study moves beyond accuracy to explain how coding agents localize, reason, and act during multi-hunk repair. Our findings underscore the impact of bug divergence and spatial proximity on multi-hunk repair success for coding agents.
LLM (ORG) Claude (PERSON) Codex (ORG) Gemini (ORG) Qwen Code (PERSON) Maple (ORG)
Originally published by arXiv CS Read original →