Bayesian RL
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief
arXiv:2606.00680v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) aims to optimize policies from pre-collected datasets. A bottleneck of this paradigm is managing epistemic uncertainty, which arises from limited data coverage (sample-level) and the ambiguity in identifying transition dynamics from finite data (model-level). To provide a unified quantification of these uncertainties, Bayesian RL has been proposed by treating the dynamics model as a random variable and...
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
arXiv:2606.05296v1 Announce Type: new Abstract: LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-of-the-art proprietary LLMs, API-only access precludes parameter-level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference.
A thalamus–brainstem attractor network drives history-biased decisions
Abstract Natural environments often change gradually, making it adaptive to bias decisions on the basis of the recent past — a phenomenon known as serial dependence1,2,3. Large-scale recordings during behaviour have identified that serial dependence is a common motif for decision-making, with neural representations of past experiences found throughout the brain4,5,6,7,8,9,10,11. However, it remains unclear whether this bias arises from dedicated neural circuits with history-specific...
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification
arXiv:2606.01476v1 Announce Type: new Abstract: On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of...
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
arXiv:2606.05606v1 Announce Type: new Abstract: LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method,...
Human-Like Neural Nets by Catapulting
Human-like Neural Nets by Catapulting Speculative proposal to create artificial neural nets with human-like performance by high-learning-rate/regularization training of overparameterized NNs to trigger catapulting/grokking. Over-parameterization as a route to true generalization would resolve many outstanding mysteries of artificial versus natural intelligence. There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are...