Home › Knowledge Base › Reinforcement Learning for Search Agents

Reinforcement Learning for Search Agents

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

arXiv:2606.02373v1 Announce Type: new Abstract: Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that...

arXiv CS 8d ago

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Announce Type: replace Abstract: Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of...

arXiv CS 8d ago

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

arXiv:2605.29796v2 Announce Type: replace Abstract: Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to...

arXiv CS 9d ago

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

arXiv:2604.18401v3 Announce Type: replace Abstract: Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions...

arXiv CS 2d ago

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

arXiv:2606.03108v1 Announce Type: new Abstract: Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions,...

arXiv CS 7d ago

Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

arXiv:2603.24324v4 Announce Type: replace Abstract: Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptimal coordination, particularly when sparse task rewards provide insufficient grounding for coordinated behavior. This study introduces an autonomous reward design framework that uses large language models (LLMs) to synthesize executable reward programs from environment instrumentation. The procedure...

arXiv CS 8d ago

Reinforcement Learning-Enabled Agent for Transmitter Optimization in Digital-Analog Radio-over-Fiber Fronthaul

arXiv:2606.04840v1 Announce Type: new Abstract: Digital-analog radio-over-fiber (DA-RoF) has emerged as a promising fronthaul solution that combines the high spectral efficiency of analog transmission with the robustness of digital transmission. However, the performance of DA-RoF critically depends on several tightly coupled parameters, including the rounding factor (RF), scaling factor (SF), geometric shaping (GS) factor, and pre-equalization taps coefficients, which jointly affect...

arXiv Physics 6d ago

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

arXiv:2606.01830v1 Announce Type: new Abstract: Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards. On hard problems, these agents rarely sample end-to-end successful rollouts, leaving outcome-only RLVR with few positive-reward trajectories.

arXiv CS 8d ago

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Announce Type: new Abstract: Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce...

arXiv CS 9d ago

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Announce Type: replace Abstract: Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools.

arXiv CS 2d ago