AgroOmni: A Large-Scale Multi-view Agricultural Dataset for Cross-Scale Multimodal Reasoning

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Jiarui Zhang, Junqi Hu, Zurong Mai, Yang Liu, Yuhang Chen, Shuohong Lou, Henglian Huang, Hong Cheng, Lingyuan Zhao, Jianxi Huang, Yutong Lu, Haohuan Fu, Juepeng Zheng 1 min read

Key Points

Announce Type: replace Abstract: Modern agricultural data is sourced from diverse platforms and spans multiple spatial scales, ranging from ground-level close-up photography to Unmanned Aerial Vehicle (UAV) aerial observation and satellite remote sensing imagery. Accordingly, agricultural multimodal reasoning demands robust cross-scale spatial understanding. However, due to the lack of multi-view agricultural benchmark datasets, existing multimodal large language models (MLLMs) exhibit...

arXiv:2603.14342v2 Announce Type: replace Abstract: Modern agricultural data is sourced from diverse platforms and spans multiple spatial scales, ranging from ground-level close-up photography to Unmanned Aerial Vehicle (UAV) aerial observation and satellite remote sensing imagery. Accordingly, agricultural multimodal reasoning demands robust cross-scale spatial understanding. However, due to the lack of multi-view agricultural benchmark datasets, existing multimodal large language models (MLLMs) exhibit severe ground-level bias, which leads to scale confusion then semantic collapse in agricultural perception tasks, such as misinterpreting farmland imagery as walls or floors. To address this, we introduce AgroOmni, a large-scale multi-view training corpus with 288K Visual Question Answering pairs covering 56 specialized task categories across 14 task types, designed to capture diverse scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, which achieves a new state-of-the-art of 62.32% on the AgroMind benchmark (+15.03% over GPT-5.2), effectively mitigating the multi-view cross-scale gap for holistic agricultural understanding. Diagnostic evaluations on AgMMU further reveal an inherent heterogeneity between macro-priors and micro-diagnostics through constrained zero-shot performance. Meanwhile, even minimal fine-tuning leads to a dramatic performance gain of AgroNVILA on AgMMU, strongly demonstrating its generalization capability empowered by AgroOmni. Full training scripts are publicly available at https://anonymous.4open.science/r/AgroOmni-6510.

Agricultural Dataset (ORG) Cross-Scale Multimodal Reasoning arXiv:2603.14342v2 (ORG) UAV (ORG) GPT-5.2 (LOCATION) AgMMU (ORG)

Originally published by arXiv CS Read original →

AgroOmni: A Large-Scale Multi-view Agricultural Dataset for Cross-Scale Multimodal Reasoning

Related Stories

Jeffery Lee breathes ‘sigh of relief’ after Alabama’s nitrogen execution deemed unconstitutional

Nearly Everyone, Everywhere, Veers Left When Walking

Popular UK seaside town hotel plunges into administration as holidaymakers updated

Scientists were excited about a blood test for many cancers — but it failed a big trial. Here's what to know.