ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Yi Zhang, Bolei Ma, Yong Cao, Chengyan Wu, Daniel Hershcovich, Anna-Carolina Haensch 1 min read

Key Points

arXiv:2606.08959v1 Announce Type: new Abstract: We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

World Heritage Sites (ORG) China (LOCATION) UNESCO World Heritage (ORG) UNESCO (ORG)

Originally published by arXiv CS Read original →

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

Related Stories

Indonesian union boss defends joining Prabowo’s government

Hong Kong authorities propose doubling size of planned Lau Fau Shan tech hub

Indonesian military court jails four soldiers over acid attack on activist

ワゴン車盗の疑い 2人とヤード側を逮捕 300台超か 神奈川県警

ワゴン車盗の疑い 2人とヤード側を逮捕 300台超か神奈川県警