LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

arXiv CS Monday 08 June 2026, 04:00 UTC By Luk\'a\v{s} Eigler, Jind\v{r}ich Libovick\'y, David Hurych 1 min read

Key Points

arXiv:2603.09403v3 Announce Type: replace Abstract: Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose LLM as a Meta-Judge, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using meta-correlation, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data are publicly available at https://github.com/eiglerl/meta-judge.

LLM (ORG) Meta (ORG) Summarization (EVENT)

Originally published by arXiv CS Read original →

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Related Stories

SVP’s Khosla Sees a ‘Lot of Canaries in the Coal Mine’ (Video)

Forget Coders. The Real A.I. Threat Is in the Back Office.

How to Remove Apps You Never Use (or at Least Hide Them)

Is A.I. affecting your career? We want to hear from you.