When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Tanya Shourya, Yingfan Wang, Zhaoyi Joey Hou, Shamik Roy, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah 1 min read

Key Points

arXiv:2510.19186v2 Announce Type: replace Abstract: Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents' tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases. Evaluation with state-of-the-art conversation evaluation frameworks reveals that all approaches remain far from ideal performance, demonstrating the fundamental difficulty of this benchmark.

TRACE (ORG)

Originally published by arXiv CS Read original →

A death row prisoner whose planned execution on Thursday was suddenly halted became emotional when he learned that a federal court had ruled Alabama’s use of nitrogen gas violates the constitutional ban on cruel and unusual punishment. “It’s like an expected sigh of relief in one aspect, and then you still got to stay and maintain your focus and continue to fight,” Jeffery Lee, who has been on death row for nearly three decades, told NBC News by phone Tuesday. He spoke from the William C....

NBC News 27m ago

Nearly Everyone, Everywhere, Veers Left When Walking

Researchers are at a loss for why people across cultures and ages, regardless of their dominant hand, have a natural bias toward wandering in a counterclockwise direction.

NYT Science 36m ago

Popular UK seaside town hotel plunges into administration as holidaymakers updated

Popular UK seaside town hotel plunges into administration as holidaymakers updated This popular hotel has entered administration after closing for refurbishment in 2022 A long-shuttered seaside hotel in south Devon, which had been expected to welcome guests again following a major refurbishment, has reportedly gone into administration. According to a notice published by The Gazette, the UK's official public record, administrators were appointed on June 5.

Daily Mirror 47m ago

Scientists were excited about a blood test for many cancers — but it failed a big trial. Here's what to know.

Scientists were excited about a blood test for many cancers — but it failed a big trial. Emerging tests promise to screen for many cancers at once, but one just failed in a big trial. Will these diagnostics deliver on their promise someday?

Live Science 1h ago

When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue

Related Stories

Jeffery Lee breathes ‘sigh of relief’ after Alabama’s nitrogen execution deemed unconstitutional

Nearly Everyone, Everywhere, Veers Left When Walking

Popular UK seaside town hotel plunges into administration as holidaymakers updated

Scientists were excited about a blood test for many cancers — but it failed a big trial. Here's what to know.