Agentic Monte Carlo
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
arXiv:2606.05296v1 Announce Type: new Abstract: LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-of-the-art proprietary LLMs, API-only access precludes parameter-level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference.
ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents
arXiv:2407.03884v4 Announce Type: replace Abstract: Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding and human-like responses, their **lack of controllability** remains a key challenge, often leading to unfocused conversations or task failure. To address this, we introduce Standard Operating Procedure (SOP) to regulate dialogue flow.
Partisan voter model on complex networks: Dynamics of local ordering
arXiv:2606.05062v1 Announce Type: new Abstract: We investigate the processes of local ordering for the partisan voter model on complex networks. In this model agents hold a binary opinion and a fixed preference that biases updates toward alignment with their preferred state. We first study the dynamics on uncorrelated random networks and derive a pair approximation that resolves the densities of links between different classes of agents.
Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation
arXiv:2605.31278v1 Announce Type: new Abstract: Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies...
Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation
arXiv:2605.31278v2 Announce Type: replace Abstract: Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies...
Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning
arXiv:2606.06673v1 Announce Type: new Abstract: Sparse rewards and heterogeneous task sequences remain persistent challenges in Reinforcement Learning (RL), often resulting in slow convergence, weak generalization, and inefficient exploration. We propose Uncertainty-Aware LLM-Guided Policy Shaping (ULPS), a novel framework that integrates a calibrated Large Language Model (LLM) into the RL training loop to provide structured, uncertainty-modulated behavioral guidance. ULPS employs an...
Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures
Announce Type: new Abstract: When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&When...
HAVE: Host Active Verification Engine for Closing the Contextual Reality Gap in Security Digital Twins
arXiv:2606.06968v1 Announce Type: new Abstract: Security Digital Twins (SDTs) provide continuously updated virtual replicas of infrastructure for threat simulation, yet they rely on theoretical CVSS scores to assign lateral-movement probabilities -- creating the Contextual Reality Gap: risk is overestimated where unacknowledged mitigations neutralize exploits, and drastically underestimated where logic flaws bypass all memory-safety defenses. We present the Host Active Verification Engine...
RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing
arXiv:2606.06140v1 Announce Type: new Abstract: Image safety classifiers serve as a critical component of contemporary content moderation systems on the internet. However, their resilience against user-style malicious image editing remains underexplored. Such behaviors are highly prevalent in daily scenarios but difficult to fully reproduce.
Preparing for the Next Carrington: Spatiotemporal Agent-Based Modeling for Safeguarding Satellite Infrastructure Under Extreme Space Weather Disturbances
Announce Type: new Abstract: Extreme space weather poses an existential threat to modern satellite infrastructure, with a Carrington-class solar storm projected to cause economic losses of billions of dollars per day. Due to the rapid proliferation of satellites (with over 70,000 expected to be deployed in the next 5 years), understanding extreme space weather impacts has become essential for global economic stability and national security, and consequently, the lives of millions. However,...