Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Katelyn Xiaoying Mei, Yi-Li Hsu, Minjoon Choi, Zongwan Cao, Chenjun Xu, Bingbing Wen, Su Lin Blodgett, Lucy Lu Wang 1 min read

Key Points

arXiv:2606.07936v1 Announce Type: new Abstract: Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

Analysis of Human Evaluation Protocols for Long (ORG) LLM (ORG)

Originally published by arXiv CS Read original →

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

Related Stories

When 'Island Nemo' went missing, locals suspected foul play

Artificial turf contains 400 chemicals tied to cancer and hormone disruption. But is it unsafe?

Japan’s Retail Investor Army Flocks to SpaceX After IPO Drought

NASA addresses criticism over all-male crew selected for Artemis III test mission