FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

arXiv CS Friday 05 June 2026, 04:00 UTC By Yihao Wu, He Zhang, Junbo Tan, Xueqian Wang, Zhengyou Zhang 1 min read

Key Points

Announce Type: new Abstract: Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs.

arXiv:2606.05468v1 Announce Type: new Abstract: Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. Algorithmically, we propose RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the flow-matching action head of VLA models. RPRO pairs a contrastive optimizer with an explicit proximal regularizer that anchors the absolute magnitude of the implicit reward, thereby eliminating the reward-hacking failure mode of plain Flow-DPO. On the data side, a teleoperated intervention-and-rollback paradigm produces naturally paired positive and negative trajectories $(\tau^w, \tau^l)$ on a real robot from a single operator action; a Smooth Interpolation procedure, combined with batch mixing, then converts these sparse corrections into dense per-state supervision while preserving the base policy's capabilities. On four long-horizon bimanual tasks, FlowPRO attains the highest success rate, outperforming four representative baselines, and ablations confirm the contribution of each loss component.

Proximalized Preference Optimization arXiv:2606.05468v1 (ORG) SFT (ORG) DAgger (ORG) RL (ORG) RPRO (ORG) Proximalized Preference Optimization (ORG) VLA (ORG) Smooth Interpolation (ORG)

Originally published by arXiv CS Read original →

FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

Related Stories

A Meta Employee Who Just Lost Their Job Was Detained by Immigration Agents

Farage suddenly returns to political stage – but dodges questions about £5m gift

Bill Gates says Epstein wanted personal relationship, but he 'never reciprocated'

Bill Gates says Epstein wanted personal relationship, but he 'never reciprocated'