SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang 1 min read

Key Points

arXiv:2602.12984v2 Announce Type: replace Abstract: Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models still struggle with complex scientific tool-use, and their performance degrades substantially as interaction horizons extend. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

Benchmarking Multi-Step Scientific Tool (ORG) SciAgentGym (ORG) SciForge (ORG)

Originally published by arXiv CS Read original →

Genetically modified worms can now produce and deliver drugs inside a living body, scientists say In a proof-of-concept lab experiment, scientists demonstrated that intestinal parasites could make and release therapeutic agents inside a living host. Scientists genetically tweaked a tiny, worm-like parasite to produce a life-saving antitoxin from inside a living host. In a first-of-its-kind study, researchers modified the hookworm Ancylostoma ceylanicum so that it produces antibodies that...

Live Science 36m ago

Indonesia Landslides Devastated Endangered Orangutans, Study Finds

More than 5 percent of the species is estimated to have been lost when a climate-fueled storm unleashed torrents of water, mud and debris.

NYT Science 45m ago

Mysterious 'cold blob' in the Atlantic is a sign of the Gulf Stream weakening — and that's bad news for the US East Coast

Mysterious 'cold blob' in the Atlantic is a sign of the Gulf Stream weakening — and that's bad news for the US East Coast The Atlantic's enigmatic "cold blob" has once again been linked to a weakening of key ocean currents and a devastating climate tipping point. A mysterious "cold blob" in the Atlantic Ocean is a sign that key ocean currents are weakening, a new study has found, with potentially devastating long-term impacts on our climate and weather. The cold blob, or North Atlantic...

Live Science 50m ago

Neuroscientist reveals the one 'superfood' he eats every single day to slow down ageing

Neuroscientist reveals the one 'superfood' he eats every single day to slow down ageing Neuroscientist Dr David Cox has spoken about how what we eat influences how we age while revealing the one 'superfood' he consumes daily to be as healthy as possible A neuroscientist and health journalist has revealed the one 'superfood' he eats every single day to slow down the ageing process. Dr David Cox, who is the author of The Age Code, made the comments on Tonight on ITV. The documentary looked at...

Daily Mirror 51m ago

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Related Stories

Genetically modified worms can now produce and deliver drugs inside a living body, scientists say

Indonesia Landslides Devastated Endangered Orangutans, Study Finds

Mysterious 'cold blob' in the Atlantic is a sign of the Gulf Stream weakening — and that's bad news for the US East Coast

Neuroscientist reveals the one 'superfood' he eats every single day to slow down ageing