WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

arXiv CS Monday 08 June 2026, 04:00 UTC By Yida Yin, Harish Krishnakumar, Chung Peng Lee, Boya Zeng, Wenhao Chai, Shengbang Tong, Wenhu Chen, Hu Xu, Xingyu Fu, Gabriel Sarch, Aleksandra Korolova, Zhuang Liu 1 min read

Key Points

arXiv:2606.06538v1 Announce Type: new Abstract: In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

WorldBench (ORG)

Originally published by arXiv CS Read original →

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Related Stories

US Launches Second Round of Strikes on Iran

Westpac Mortgage Applications Fall as Tax Changes Sap Demand

How AI Is Changing Asia’s Workplaces

Highlights from Bloomberg Invest Hong Kong