Home › Knowledge Base › Contrastive Reinforcement Learning

Contrastive Reinforcement Learning

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

arXiv:2606.08533v1 Announce Type: new Abstract: Unmanned aerial vehicles (UAVs) are increasingly being deployed in logistics, service robotics, and other real-world applications, creating a growing demand for autonomous payload acquisition and delivery. Existing approaches typically assume pre-attached payloads or rely on specialized grippers, leaving versatile end-to-end aerial delivery largely unresolved, where different payloads induce highly variable flight dynamics, requiring a single...

arXiv CS 1d ago

SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

arXiv:2604.17551v2 Announce Type: replace Abstract: Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called survival value learning (SVL), that reframes GCRL as a survival learning problem by modeling the time-to-goal from each state as a...

arXiv CS 9d ago

BranPO: Scalable Contrastive Branch Sampling for Long-Horizon Agentic Reinforcement Learning

Announce Type: replace Abstract: Agentic reinforcement learning enables large language models to perform multi-turn planning and tool use, but long-horizon training remains challenging under sparse trajectory-level rewards, where a single outcome is uniformly assigned to all decisions. Prior methods introduce finer-grained supervision via tree-based exploration or process-level evaluation, but often incur high cost or produce noisy credit signals. In agentic trajectories, early mistakes may...

arXiv CS 8d ago

Survival Reinforcement Learning: Toward Scalable Self-Supervised RL

arXiv:2605.31273v1 Announce Type: new Abstract: While self-supervised Contrastive Reinforcement Learning (CRL) has shown remarkable depth-scaling capabilities, successfully using networks over 64 layers, scaled CRL still struggles with long-horizon goal-conditioned planning due to the uniformity-tolerance dilemma inherent in contrastive losses. We introduce Survival Reinforcement Learning (SRL), an online classification-based alternative that extends the survival value learning framework by...

arXiv CS 9d ago

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

arXiv:2605.12969v3 Announce Type: replace Abstract: Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts. This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores, in...

arXiv CS 8d ago

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

Announce Type: new Abstract: Preference-based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or preference model from labeled preferences and then performing offline RL on unlabeled data. We revisit offline PbRL through the lens of reward-free representation learning (RFRL) from the zero-shot RL literature, and propose a new...

arXiv CS 8d ago

Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning

arXiv:2606.05506v1 Announce Type: new Abstract: We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen,...

arXiv CS 5d ago

Episodic Memory Temporal Consistency for Cooperative Multi-Agent Reinforcement Learning

arXiv:2606.04492v1 Announce Type: new Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) frequently suffers from severe reward sparsity and exploration bottlenecks. While episodic memory mechanisms mitigate these issues by reusing high-return trajectories, they often trap agents in local optima due to unconstrained incentive distribution and semantic representation collapse. To address this, we propose Episodic Memory Temporal Consistency (EMTC), a framework that robustly...

arXiv CS 6d ago

Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback

arXiv:2605.30478v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) trains language models using programmatically checkable signals such as unit-test outcomes, enabling direct optimization for functional correctness in code generation. We conduct an empirical study of RLVR for Python code generation on the MBPP benchmark using two small models (Qwen3-0.6B and Llama3.2-1B) with LoRA fine-tuning. Across multiple reward formulations such as: unit-test-only...

arXiv CS 9d ago

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

arXiv:2511.07317v2 Announce Type: replace Abstract: We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data...

arXiv CS 1d ago