Home › Knowledge Base › Bradley--Terry

Bradley--Terry

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Bradley-Terry Rankings for Recommender Systems Across Dataset Taxonomies

arXiv:2606.07492v1 Announce Type: new Abstract: The ranking of recommendation algorithms is a challenging problem since model performance is sensitive to dataset characteristics such as sparsity, sequential structure, and scale. This drives a demand for a proper methodology for fair comparison between algorithms.

arXiv CS 2d ago

Pluralistic Leaderboards

arXiv:2606.02547v1 Announce Type: new Abstract: Recent leaderboard-based evaluations of large language models aggregate user feedback by fitting a Bradley--Terry model to pairwise comparisons, producing a single global ranking based on a latent quality score. While appealing for its simplicity, this approach is incompatible with heterogeneous preferences: when LLMs are used across diverse tasks and use cases, users who favor fundamentally different model behaviors can be systematically...

arXiv CS 8d ago

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

arXiv:2512.21917v3 Announce Type: replace Abstract: Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study policy alignment under an unknown and unrestricted link function.

arXiv CS 6d ago

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

arXiv:2602.10623v2 Announce Type: replace Abstract: Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT)...

arXiv CS 8d ago

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

arXiv:2605.30619v1 Announce Type: cross Abstract: Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best is paired with a rejected response. Despite its widespread use, what Bradley--Terry (BT) reward learning extracts from such data, and how to choose $N$ and the base distribution, remain unclear. We specialize a recent analysis of preference data via its induced conditional distribution to Best-of-$N$. For...

arXiv CS 9d ago

What Does Preference Learning Recover from Pairwise Comparison Data?

arXiv:2602.10286v2 Announce Type: replace Abstract: Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets $(x, y^+, y^-)$, where response $y^+$ is preferred over response $y^-$ for context $x$. The Bradley--Terry (BT) model is the predominant approach, modeling preference probabilities as a function of latent score differences.

arXiv CS 9d ago

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

arXiv:2606.01561v1 Announce Type: new Abstract: Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs.

arXiv CS 8d ago

Which sparkling water is the best?

The Sparkling Water Report three minds and gullets looking for the winning bubbles With my friends Manuel and Aurélien, also friends of the fizz, we set out to find which sparkling water is the best one. We limited ourselves to ones that you could readily buy in Paris, up to the limit of what we could carry. This means 14 waters, blind tested: each water was poured in an opaque glass associated to a number, the glasses were then shuffled and turned facing opposite of the drinkers.

Hacker News 7d ago

Benchmarking at the Edge of Comprehension

arXiv:2602.14307v4 Announce Type: replace Abstract: As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this...

arXiv CS 8d ago

Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge

arXiv:2512.03019v2 Announce Type: replace Abstract: Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking--rating samples per item, and propose a principled, distribution-calibrated aggregation...

arXiv CS 7d ago