CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

arXiv CS Friday 05 June 2026, 04:00 UTC By Alexander Apartsin, Yehudit Aperstein 1 min read

Key Points

Announce Type: replace Abstract: Selecting a pretrained language model, or evaluating a fine-tuned one, for a specific application is a high-value decision, yet the public benchmarks used to make it are poorly suited: a generic benchmark need not reflect a particular sub-domain or sub-task, and its scores are suspect when its items have leaked into pretraining and are recalled rather than solved. We present CoEval, an open framework that supplies a trustworthy, task-specific signal through...

arXiv:2606.03650v2 Announce Type: replace Abstract: Selecting a pretrained language model, or evaluating a fine-tuned one, for a specific application is a high-value decision, yet the public benchmarks used to make it are poorly suited: a generic benchmark need not reflect a particular sub-domain or sub-task, and its scores are suspect when its items have leaked into pretraining and are recalled rather than solved. We present CoEval, an open framework that supplies a trustworthy, task-specific signal through ensemble self-evaluation: from a task or domain description, a pool of models rotates through all three roles, teacher, student, and judge, to generate a fresh, contamination-free benchmark, answer it, and score one another, with no human labels or raters. Because every model also answers as a student, the responses are the data that weight each question by its discriminative power and each judge by its consensus with the panel. Where ground truth exists, CoEval recovers the true ranking and tracks objective correctness at \r{ho}=0.86, and the weighting recovers the gold ranking of thirteen models at Spearman 0.95. Reliability comes from panel composition, not size: this label-free weighting zeroes out broken judges and down-weights saturated questions, so neither distorts the ranking. Generated items show zero verbatim overlap with five public benchmarks, the panel cancels verbosity bias and precludes same-family self-preference, and rankings are domain-specific: three different models top four de-novo domains, so a generic leaderboard misdirects most practitioners. The same pipeline reruns on each model release, giving any team a contamination-free leaderboard for its application.

CoEval (PERSON)

Originally published by arXiv CS Read original →

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

Related Stories

Visualizing band structures in nanostructures: Extending band theory to imperfect periodic and bent systems

Grandfather's fall inspires 20yo student's smart clock invention

Man convicted in plot to shoot up Ohio State sorority now lives two blocks from campus

Karmelo Anthony appeals his murder conviction after being sentenced to 35 years for track meet stabbing