Home Knowledge Base BFCL

BFCL

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

MAVEN: Improving Generalization in Agentic Tool Calling

Announce Type: new Abstract: Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured...

arXiv CS 9d ago

Scaling Agentic Capabilities via Grounded Interaction Synthesis

arXiv:2606.02001v1 Announce Type: new Abstract: General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human annotation, prevailing paradigms depend entirely on Large Language Models (LLMs) to scale the synthesis of agentic environments and tasks. However, such unconstrained generation often degenerates into biased random...

arXiv CS 8d ago

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

Announce Type: new Abstract: Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library...

arXiv CS 7d ago

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

Announce Type: replace Abstract: Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a...

arXiv CS 6d ago

Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs

arXiv:2606.09371v1 Announce Type: new Abstract: Tool learning enables LLMs to invoke external tools to accomplish tasks. Prior studies have demonstrated the effectiveness of a hierarchical structure: a high-level policy handles global planning and decomposes tasks into manageable sub-tasks, and a low-level policy focuses on invoking tools to solve these sub-tasks. However, these works typically optimize the high-level and low-level policies separately, leading to planner-executor...

arXiv CS 1d ago

R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

Announce Type: replace Abstract: Function calling empowers large language models (LLMs) to interface with external tools, yet existing RL-based approaches suffer from misalignment between reasoning processes and tool-call decisions. We propose R2IF, a reasoning-aware RL framework for interpretable function calling, adopting a composite reward integrating format/correctness constraints, Chain-of-Thought Effectiveness Reward (CER), and Specification-Modification-Value (SMV) reward, optimized...

arXiv CS 7d ago

The Cold-Start Safety Gap in LLM Agents

arXiv:2606.07867v1 Announce Type: new Abstract: Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks -- a phenomenon we term the cold-start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a...

arXiv CS 1d ago

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

arXiv:2605.24660v2 Announce Type: replace Abstract: Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose.

arXiv CS 1d ago

ComplexConstraints and Beyond: Expert Rubrics for RLVR

arXiv:2606.09118v1 Announce Type: new Abstract: As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative...

arXiv CS 1d ago