Home › Knowledge Base › Policy Learning

Policy Learning

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Zero-Shot Off-Policy Learning

arXiv:2602.01962v2 Announce Type: replace Abstract: Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function overestimation bias. These issues become even more noticeable in zero-shot reinforcement learning, where an agent trained on reward-free data must adapt to new tasks at test time without additional training.

arXiv CS 8d ago

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

arXiv:2603.15956v3 Announce Type: replace Abstract: Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer.

arXiv CS 8d ago

Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation

arXiv:2509.03456v2 Announce Type: replace-cross Abstract: Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes.

arXiv CS 8d ago

Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)

arXiv:2606.02636v2 Announce Type: new Abstract: While sim2real efforts are necessary for effective policy transfer to hardware, there is such a thing as too much of a good thing. We argue that sim2real efforts have led to misaligned incentives with policy learning, resulting in simulator lock in and poor policy exploration due to the unreasonable constraints imposed by the real world. We offer a diagnosis and explanation of the current status of the problem, and propose a potential solution...

arXiv CS 6d ago

Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)

arXiv:2606.02636v1 Announce Type: new Abstract: While sim2real efforts are necessary for effective policy transfer to hardware, there is such a thing as too much of a good thing. We argue that sim2real efforts have led to misaligned incentives with policy learning, resulting in simulator lock in and poor policy exploration due to the unreasonable constraints imposed by the real world. We offer a diagnosis and explanation of the current status of the problem, and propose a potential solution...

arXiv CS 7d ago

DiffAero: A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning

arXiv:2509.10247v1 Announce Type: cross Abstract: This letter introduces DiffAero, a lightweight, GPU-accelerated, and fully differentiable simulation framework designed for efficient quadrotor control policy learning. DiffAero supports both environment-level and agent-level parallelism and integrates multiple dynamics models, customizable sensor stacks (IMU, depth camera, and LiDAR), and diverse flight tasks within a unified, GPU-native training interface. By fully parallelizing both...

arXiv CS 6d ago

Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback

arXiv:2606.01081v1 Announce Type: new Abstract: Decision-focused learning (DFL) trains predictive models by optimizing downstream decision quality rather than standalone prediction accuracy. For contextual linear optimization, most existing DFL methods assume offline data and full observations of the objective cost vector. We develop an on-policy learning method for sequential contextual linear optimization under partial feedback, generalizing the standard bandit feedback setting.

arXiv CS 8d ago

Autonomous Obstacle Removal for Excavators through Policy Learning with Particle Simulation

arXiv:2606.09183v1 Announce Type: new Abstract: Autonomous obstacle removal from the ground is an important earthwork task, but this is difficult to automate because an excavator must adapt its excavation trajectories over repeated cycles as soil-obstacle conditions change. Learning such state-dependent behavior requires a training environment that reproduces accumulated soil-obstacle interactions, including contact states, terrain deformation, and obstacle visibility. Accordingly,...

arXiv CS 1d ago

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

arXiv:2509.25004v2 Announce Type: replace Abstract: Online reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning abilities of large language models, but most methods still optimize reasoning trajectories over the static problem set, wasting rollout budget on solved or overly difficult problems. We propose \textbf{CLPO (Curriculum Learning meets Policy Optimization)}, a self-evolving curriculum framework that uses on-policy rollout...

arXiv CS 1d ago

Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

Announce Type: new Abstract: End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and...

arXiv CS 8d ago