StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

arXiv CS Friday 05 June 2026, 04:00 UTC By Evans Han, Yunfan Jiang, Yingke Wang, Haoyue Xiao, Huang Huang, Jianwen Xie, Jiajun Wu, Li Fei-Fei, Ruohan Zhang 1 min read

Key Points

arXiv:2605.09989v2 Announce Type: replace Abstract: Recent advances in robot imitation learning have produced powerful visuomotor policies that manipulate diverse objects from visual inputs. However, monocular observations lack depth information, which is critical for precise manipulation in cluttered or geometrically complex scenes. Explicit depth maps and point clouds are often noisy and fragile in real-world manipulation. We introduce StereoPolicy, a visuomotor policy learning framework that directly leverages synchronized stereo image pairs to improve geometric reasoning without constructing explicit 3D representations. StereoPolicy processes each image with pretrained 2D vision encoders and fuses left-right features through a cross-attention-based Stereo Transformer, capturing spatial correspondence and disparity cues implicitly. The framework integrates with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks and seven real-robot tabletop and bimanual mobile manipulation tasks. Our results show that stereo vision bridges 2D pretrained representations and 3D geometric understanding for robotic manipulation.

StereoPolicy (ORG) Stereo Transformer (ORG) RGB (ORG)

Originally published by arXiv CS Read original →

StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

Related Stories

Britain Is Weighing a Social Media Ban for Children. How Did It Get Here?

Britain Is Weighing a Social Media Ban for Children. How Did It Get Here?

Japan’s Retail Investor Army Flocks to SpaceX After IPO Drought

Blockbuster new Raspberry Pi project turns any screen into old-school VCR