Home › Knowledge Base › Proximal Policy Optimization

Proximal Policy Optimization

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Variational Proximal Policy Optimization

Announce Type: cross Abstract: Reinforcement Learning from Human Feedback via Proximal Policy Optimization often suffers from policy mode collapse, brittle exploration loops, and distribution drift. This paper introduces Variational Proximal Policy Optimization ($\textsc{VP}_2\textsc{O}$), a particle-based variational inference framework that maps policy optimization to Stein Variational Gradient Descent within a Mixture-of-Experts architecture. By leveraging functional kernels over...

arXiv CS 1d ago

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

arXiv:2606.04807v1 Announce Type: new Abstract: Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training...

arXiv CS 6d ago

Stabilizing Policy Optimization via Logits Convexity

arXiv:2603.00963v2 Announce Type: replace Abstract: While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical...

arXiv CS 8d ago

FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

Announce Type: new Abstract: Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs.

arXiv CS 5d ago

TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

arXiv:2606.08379v1 Announce Type: new Abstract: This study addresses the optimal execution of large stock sell programs by introducing TT-DAC-PS (Twin-Target Deterministic Actor-Critic with Policy Smoothing), a deterministic actor-critic architecture that combines twin exponential-moving-average critic targets with pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation to curb overestimation. Exploration uses...

arXiv CS 1d ago

Multi-Objective Reinforcement Learning for Tactical Decision Making for Trucks in Highway Traffic

arXiv:2601.18783v2 Announce Type: replace Abstract: Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles. A central difficulty is that conventional scalar reward formulations, obtained by aggregating these competing objectives, often obscure the structure of their trade-offs. We present a Proximal Policy Optimization based multi-objective reinforcement learning framework that learns a set of policies...

arXiv CS 8d ago

Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

arXiv:2606.03382v2 Announce Type: replace Abstract: While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization paradigm struggles in continual and non-stationary environments. The failure does not stem from insufficient model capacity or overly restrictive clipping. Instead, PPO performs persistent, directionally inefficient local updates, which indicates a lack of geometry-aware guidance for accumulating meaningful...

arXiv CS 2d ago

Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

arXiv:2606.03382v1 Announce Type: new Abstract: While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization paradigm struggles in continual and non-stationary environments. The failure does not stem from insufficient model capacity or overly restrictive clipping. Instead, PPO performs persistent, directionally inefficient local updates, which indicates a lack of geometry-aware guidance for accumulating meaningful...

arXiv CS 7d ago

RLEASE: Reinforcement Learning Efficient Active Space Engine

arXiv:2606.07879v1 Announce Type: new Abstract: Selecting the active space for multireference electronic-structure calculations is a long-standing bottleneck that often requires expert chemical intuition and costly trial-and-error. We introduce RLEASE (Reinforcement Learning Efficient Active Space Engine), a low-cost method for automatic, geometry-dependent active-space selection. A neural network predicts per-orbital diagnostic scores ($\hat{s}_{1}$) from inexpensive Hartree-Fock orbital...

arXiv Physics 1d ago

Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings

arXiv:2606.02049v1 Announce Type: new Abstract: The increasing integration of renewable energy sources into power systems, particularly in buildings equipped with photovoltaic (PV) panels and energy storage systems, introduces significant complexity in energy systems. Volatile power generation, varying electricity tariffs, and increased entities, e.g., PV systems, and heat pumps, have increased the complexity and made the system harder to operate. This leads to the demand for additional...

arXiv CS 8d ago