Home › Knowledge Base › VLA Training

VLA Training

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

Announce Type: new Abstract: Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories...

arXiv CS 6d ago

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

arXiv:2606.04708v2 Announce Type: replace Abstract: Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and...

arXiv CS 5d ago

Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

arXiv:2602.12628v4 Announce Type: replace Abstract: Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited.

arXiv CS 5d ago

TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models

arXiv:2606.03127v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models trained on large-scale data have made remarkable progress, but they remain vulnerable to distribution shifts at deployment time. Recent VLA models suggest that prompts can serve as an efficient interface for steering policy behavior, but existing prompt-based steering typically relies on external guidance. This raises a natural question: can test-time training (TTT) for VLA be achieved by optimizing a prompt,...

arXiv CS 7d ago

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

arXiv:2606.07100v1 Announce Type: new Abstract: Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for...

arXiv CS 2d ago

World2Act: Latent Action Post-Training from World Model Dynamics

arXiv:2603.10422v2 Announce Type: replace Abstract: World Models (WMs) offer a promising mechanism for post-training Vision-Language-Action (VLA) policies by providing dynamics priors that improve generalization under task and scene variation. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to visual artifacts introduced by imperfect WM rollouts. We present World2Act, a latent-space post-training framework that transfers WM dynamics to...

arXiv CS 9d ago

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

arXiv:2606.02735v2 Announce Type: replace Abstract: Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify...

arXiv CS 1d ago

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

Announce Type: new Abstract: Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the...

arXiv CS 7d ago

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

arXiv:2605.31286v1 Announce Type: new Abstract: Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies...

arXiv CS 9d ago

Scaling by Diversified Experience for Vision-Language-Action Models

arXiv:2606.09009v1 Announce Type: new Abstract: Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA, a robust VLA model trained with diversified experiences. We propose an Intention Decoupling algorithm to isolate control-relevant features from reasoning contexts and a similar-sample guided RL pipeline to...

arXiv CS 1d ago