Home Knowledge Base MDP

MDP

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

arXiv:2606.06058v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance...

arXiv CS 5d ago

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

arXiv:2606.07017v1 Announce Type: new Abstract: Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the...

arXiv CS 2d ago

Bayesian learning for the stochastic shortest path problem

Announce Type: cross Abstract: Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal decision strategy through interactions with the decision-making task.

arXiv CS 6d ago

Model-Based Learning of Whittle indices

Announce Type: replace Abstract: We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with...

arXiv CS 1d ago

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Announce Type: new Abstract: Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP).

arXiv CS 7d ago

Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

arXiv:2602.01903v2 Announce Type: replace Abstract: This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a...

arXiv CS 7d ago

Task-Induced Representational Invariances Depend on Learning Objective in Deep RL

arXiv:2606.01868v1 Announce Type: new Abstract: Reinforcement Learning (RL) has long served as a model for goal-directed animal behavior in neuroscience. Modern deep RL has shown remarkable success across many domains, further strengthening this connection.

arXiv CS 8d ago

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

arXiv:2604.18401v3 Announce Type: replace Abstract: Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions...

arXiv CS 2d ago

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Announce Type: replace Abstract: Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of...

arXiv CS 8d ago

Robust Restless Multi-Armed Bandit for Data Center Flexibility Services Through Virtual Machine Scheduling

arXiv:2605.19116v2 Announce Type: replace Abstract: Energy demands from data centers have surged and stressed the grid in recent years. Electric grids require balancing supply and demand every second, motivating demand response (reduction) from large loads, including data centers. This can be achieved by rescheduling jobs on a physical machine.

arXiv CS 2d ago