Home › Knowledge Base › Policy Smoothing

Policy Smoothing

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

arXiv:2606.08379v1 Announce Type: new Abstract: This study addresses the optimal execution of large stock sell programs by introducing TT-DAC-PS (Twin-Target Deterministic Actor-Critic with Policy Smoothing), a deterministic actor-critic architecture that combines twin exponential-moving-average critic targets with pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation to curb overestimation. Exploration uses...

arXiv CS 1d ago

ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning

arXiv:2605.30612v1 Announce Type: new Abstract: Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but introduces phase lag; embedding smoothness penalties in the actor's loss couples them with the RL gradient and conflates reward regression with over-aggressive smoothing. We present ZAPS-DA, a framework that reduces...

arXiv CS 9d ago

Rethinking the Divergence Regularization in LLM RL

Announce Type: new Abstract: Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies.

arXiv CS 1d ago

Acquiring Human-Like Data-Efficient Mechanics Prediction from Deep Reinforcement Learning

Announce Type: replace Abstract: Humans can infer mechanical outcomes by learning from a few observations. This capacity for mechanics intuition is acquired in a data-efficient manner. Here, we propose a reinforcement learning framework to mimic this process, in which an agent encodes continuous physical observation parameters into its state and is trained via episodic switching across closely related observations.

arXiv Physics 1d ago

Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space

Announce Type: replace Abstract: Reinforcement learning in discrete-continuous hybrid action spaces presents fundamental challenges for robotic manipulation, where high-level task decisions and low-level joint-space execution must be jointly optimized. Existing approaches either discretize continuous components or relax discrete choices into continuous approximations, which suffer from scalability limitations and training instability in high-dimensional action spaces and under domain...

arXiv CS 8d ago

Toward accurate RUL and SoH estimation using reinforced graph-based physics-informed neural networks enhanced with dynamic weights

arXiv:2507.09766v2 Announce Type: replace Abstract: Accurate estimation of Remaining Useful Life (RUL) and State of Health (SoH) is essential for reliable Prognostics and Health Management (PHM), supporting timely maintenance and dependable industrial operation. However, hybrid models that combine data-driven learning with physics-based regularization often rely on fixed loss weights and therefore lose accuracy when transferred across assets with different degradation behaviors. This study...

arXiv CS 8d ago

Deep reinforcement learning with spatial and temporal awareness for active boundary control of buoyancy-driven convection

arXiv:2606.06191v1 Announce Type: new Abstract: Deep reinforcement learning (DRL) applied to thermal convection control consistently produces \textit{degenerate actuation}: wall-temperature policies whose outputs are saturated, pseudo-random, or spatially incoherent. Two compounding deficiencies are responsible: multilayer-perceptron policies that discard spatial flow structure, and memoryless policies that cannot distinguish self-induced flow changes from background evolution. Together they...

arXiv Physics 5d ago

TACO: General Acrobatic Flight Control via Target-and-Command-Oriented Reinforcement Learning

arXiv:2503.01125v5 Announce Type: replace Abstract: Although acrobatic flight control has been studied extensively, one key limitation of the existing methods is that they are usually restricted to specific maneuver tasks and cannot change flight pattern parameters online. In this work, we propose a target-and-command-oriented reinforcement learning (TACO) framework, which can handle different maneuver tasks in a unified way and allows online parameter changes. Additionally, we propose a...

arXiv CS 1d ago

T-GMP: Terrain-conditioned Generative Motion Priors for Versatile and Natural Humanoid Locomotion

arXiv:2606.06944v1 Announce Type: new Abstract: Achieving both anthropomorphic naturalness and robust terrain traversal remains a fundamental challenge in humanoid locomotion. Existing Reinforcement Learning (RL) approaches typically rely on fixed motion priors, limiting their adaptability to varying environments. We propose Terrain-conditioned Generative Motion Priors (T-GMP), a module that captures a terrain-conditioned latent motion manifold from a few expert state-terrain demonstrations...

arXiv CS 2d ago

WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

arXiv:2606.04907v1 Announce Type: new Abstract: Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error...

arXiv CS 6d ago