Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Mina Remeli, Moritz Hardt 1 min read

Key Points

arXiv:2606.09409v1 Announce Type: new Abstract: Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.

Elo (ORG)

Originally published by arXiv CS Read original →

Scientists discover 5 million-year-old whale graveyard stretching for hundreds of miles in the Indian Ocean Researchers have discovered a "megasite" of dead whales along with new species of marine life feasting on the corpses. Scientists have discovered a vast whale graveyard stretching for hundreds of miles in the Indian Ocean, with some fossil bones dating back over 5 million years. The deep-sea "megasite," which the researchers have named the Diamantina Zone necropolis, is the most...

Live Science 18m ago

Plan for hundreds of new spaces to ease Ben Nevis parking woes

Forestry and Land Scotland is proposing to extend its North Face car park near Torlundy.

BBC Scotland 36m ago

Plan for hundreds of new spaces to ease Ben Nevis parking woes

Forestry and Land Scotland is proposing to extend its North Face car park near Torlundy.

BBC Scotland 36m ago

Low-copper paints matched high-copper rivals, while silicone performed best against fouling

When comparing different types of antifouling paints against fouling on leisure boats, the results were the opposite of what many would expect. Of the paints tested, the biocide-free silicone paint worked best, and the paint marketed as environmentally friendly turned out to be extremely toxic. The study, led by Chalmers University of Technology in Sweden, was conducted in Swedish, Danish and French coastal waters and also showed that coatings with a low copper content can be just as...

Phys.org 37m ago

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

Related Stories

Scientists discover 5 million-year-old whale graveyard stretching for hundreds of miles in the Indian Ocean

Plan for hundreds of new spaces to ease Ben Nevis parking woes

Plan for hundreds of new spaces to ease Ben Nevis parking woes

Low-copper paints matched high-copper rivals, while silicone performed best against fouling