Home Knowledge Base RoboCasa

RoboCasa

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

SeeTraceAct: Visibility-Aware Latent Planning from Cross-Embodiment Demonstration Videos

Announce Type: new Abstract: Vision-language-action models (VLAs) are promising general-purpose robot policies, but adapting them to new tasks typically requires costly task-specific teleoperation data. As an alternative, we study one-shot demo-conditioned VLAs, where a robot policy is conditioned on a single demonstration video of an unseen task. We find that existing end-to-end approaches often struggle when successful execution requires precisely localizing small target regions.

arXiv CS 7d ago

What Are We Actually Benchmarking in Robot Manipulation?

arXiv:2606.04233v1 Announce Type: new Abstract: A robotics benchmark score measures success under one fixed evaluation setup, yet is routinely treated as evidence of general manipulation capability. We identify four failure modes, each of which weakens or invalidates a benchmark's role as a valid proxy for that capability: shortcut solvability, lack of statistical significance, creeping overfitting, and data-source dependence.

arXiv CS 6d ago

Robot-DIFT: Correspondence-Sensitive Diffusion Features for Contact-Rich Robot Manipulation

arXiv:2602.11934v2 Announce Type: replace Abstract: Robot manipulation often fails in the final millimeters: a policy may recognize the right object yet miss the pose offsets, boundaries, or pre-contact alignments needed for action. We argue that such failures arise when semantic invariance suppresses correspondence cues for closed-loop control, or when these cues are not exposed to the policy in a usable form. Modern visual encoders provide strong semantic abstractions, but contact-rich...

arXiv CS 1d ago

Symskill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation

arXiv:2510.01661v3 Announce Type: replace Abstract: Multi-step manipulation in dynamic environments remains challenging. Imitation learning (IL) is reactive but lacks compositional generalization, since monolithic policies do not decide which skill to reuse when scenes change. Classical task-and-motion planning (TAMP) offers compositionality, but its high planning latency prevents real-time failure recovery.

arXiv CS 1d ago

Contrastive Representation Regularization for Vision-Language-Action Models

Announce Type: replace Abstract: Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for...

arXiv CS 8d ago

World2Act: Latent Action Post-Training from World Model Dynamics

arXiv:2603.10422v2 Announce Type: replace Abstract: World Models (WMs) offer a promising mechanism for post-training Vision-Language-Action (VLA) policies by providing dynamics priors that improve generalization under task and scene variation. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to visual artifacts introduced by imperfect WM rollouts. We present World2Act, a latent-space post-training framework that transfers WM dynamics to...

arXiv CS 9d ago

CAPE: Contrastive Action-conditioned Parallel Encoding for Embodied Planning

Announce Type: new Abstract: Embodied agents need to predict the future consequences of candidate actions in order to plan effectively before execution. Existing visual dynamics models learn by reconstructing future visual states or rolling out dense latent representations, which spreads learning capacity across visually salient but planning-irrelevant content rather than the action-conditioned changes that drive manipulation outcomes. We propose CAPE, a Contrastive Action-conditioned...

arXiv CS 2d ago