When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

arXiv CS Friday 05 June 2026, 04:00 UTC By Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin 1 min read

Key Points

arXiv:2606.05806v1 Announce Type: new Abstract: Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

Tool-Integrated Reasoning (ORG) TIR (ORG) Perturbation Recovery Rate (ORG) PRR (ORG) https://github.com/Zhudongsheng75/ToolMaze (LOCATION)

Originally published by arXiv CS Read original →

Genetics breakthrough could make horned cattle a rarity in Northern Australia Thu 11 Jun 2026 at 9:33am In short: A new test has helped solve the mystery of why some cattle that were clearly born without horns were still returning "horned" results in commercial DNA tests. Researchers at the University of Queensland have identified a previously undetected gene variant in tropical cattle breeds, such as brahmans, solving a mystery that has frustrated producers for years.

ABC Australia 35m ago

Drivers being urged to IGNORE sat nav instructions for 'worrying' reason

Drivers being urged to IGNORE sat nav instructions for 'worrying' reason Road safety charity IAM RoadSmart research has found 54% of drivers have been diverted onto rural roads because of congestion on motorways, dual carriageways and other major A roads. Drivers are being urged to consider ignoring sat nav instructions over fears they could send motorists down more dangerous roads. Road safety charity IAM RoadSmart research has found 54% of drivers have been diverted onto rural roads...

Daily Mirror 50m ago

New species found in Australia's most 'pristine' marine parks

Scientists discover 149 new marine species off Christmas and Cocos (Keeling) Islands Thu 11 Jun 2026 at 9:04am In short: Researchers say they have catalogued at least 149 new species from waters around Christmas and Cocos (Keeling) Islands.

ABC Australia 1h ago

Unix GC Remastered

Introduction The AF_UNIX garbage collector is an interesting piece of the kernel. It exists because sockets can be sent with SCM_RIGHTS but they can become unreachable from user-space while still being kept alive by the kernel, which is not memory efficient; in this situation, the garbage collector intervenes to free them. Not long ago, the subsystem was rewritten from scratch on top of a graph/Strongly-Connected-Components model; but it is still bug prone.

Hacker News 1h ago

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Related Stories

Researchers solve 'frustrating' horned cattle mystery

Drivers being urged to IGNORE sat nav instructions for 'worrying' reason

New species found in Australia's most 'pristine' marine parks

Unix GC Remastered