MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai 1 min read

Key Points

arXiv:2601.22900v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards are often sparse and uninformative. This limitation is especially severe for failed samples, where scalar rewards indicate only that a solution is incorrect without explaining why the reasoning breaks down. In this paper, we leverage richer verbal feedback to guide RLVR on failed samples and convert feedback-induced progress into trainable learning signals. We propose MulFeRL (Multi-turn Feedback-guided Reinforcement Learning), a multi-turn, event-triggered RLVR framework that combines progress induction for feedback-guided regeneration of failed samples, progress credit assignment for learning from verifier-confirmed progress, and structured feedback injection for integrating feedback into the model's reasoning process. Trained on sampled OpenR1-Math, MulFeRL outperforms supervised, self-distillation-based, and RLVR baselines in-domain, while also showing strong out-of-domain generalization.

MulFeRL (ORG)

Originally published by arXiv CS Read original →

arXiv:2606.10490v2 Announce Type: new Abstract: A systematic comparison between VMEC and HINT equilibrium calculations has been carried out for Large Helical Device plasmas to clarify the influence of the assumption of the nested flux surfaces at finite beta. Three vacuum magnetic-axis configurations, $R_{\rm axV} = \SI{3.53}{\, m}$, $\SI{3.60}{\, m}$, $\SI{3.85}{\, m}$, are examined for the beta values on the axis in the range $\beta_0 \in [0.0\%, 5.0\%]$. The magnetic-axis position, the...

arXiv Physics 1h ago

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

arXiv:2606.09859v1 Announce Type: new Abstract: MLLMs frequently hallucinate objects inconsistent with visual inputs. This issue is typically attributed to the over-reliance on language priors, which can override the visual context. Recent training-free decoding strategies address this by penalizing language priors.

arXiv CS 1h ago

Support sufficiency as action-sufficient compression: a single-cycle rate-regret formulation

arXiv:2606.09858v1 Announce Type: new Abstract: Robust decision-making requires compression. A system that forms a rich support state cannot usually preserve its full structure at the point of action. It must retain only those distinctions needed to act, verify, abstain, or defer under the current consequence geometry.

arXiv CS 1h ago

Uncertainty-aware Multi-fidelity Closure via Conditional Normalizing Flows

Announce Type: new Abstract: Reduced-order models (ROMs) provide an efficient surrogate for complex multiscale systems, but their predictive accuracy is often compromised by truncation errors and the inadequate representation of interactions between resolved and unresolved scales. The missing effect of truncated (unresolved) scales on ROM (resolved) scales is often denoted as the closure problem. In this work, we formulate ROM closure modeling as a multi-fidelity (MF) learning problem and...

arXiv CS 1h ago

MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

Related Stories

Systematic comparison of VMEC and HINT equilibrium calculations for finite-beta LHD plasmas

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

Support sufficiency as action-sufficient compression: a single-cycle rate-regret formulation

Uncertainty-aware Multi-fidelity Closure via Conditional Normalizing Flows