MDP
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
arXiv:2606.06058v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance...
The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
arXiv:2606.07017v1 Announce Type: new Abstract: Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the...
Bayesian learning for the stochastic shortest path problem
Announce Type: cross Abstract: Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal decision strategy through interactions with the decision-making task.
Model-Based Learning of Whittle indices
Announce Type: replace Abstract: We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with...
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling
Announce Type: new Abstract: Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP).
Data- and Variance-dependent Regret Bounds for Online Tabular MDPs
arXiv:2602.01903v2 Announce Type: replace Abstract: This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a...
Task-Induced Representational Invariances Depend on Learning Objective in Deep RL
arXiv:2606.01868v1 Announce Type: new Abstract: Reinforcement Learning (RL) has long served as a model for goal-directed animal behavior in neuroscience. Modern deep RL has shown remarkable success across many domains, further strengthening this connection.
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
arXiv:2604.18401v3 Announce Type: replace Abstract: Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions...
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
Announce Type: replace Abstract: Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of...
Robust Restless Multi-Armed Bandit for Data Center Flexibility Services Through Virtual Machine Scheduling
arXiv:2605.19116v2 Announce Type: replace Abstract: Energy demands from data centers have surged and stressed the grid in recent years. Electric grids require balancing supply and demand every second, motivating demand response (reduction) from large loads, including data centers. This can be achieved by rescheduling jobs on a physical machine.