Home Knowledge Base Bradley-Terry

Bradley-Terry

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Bradley-Terry Rankings for Recommender Systems Across Dataset Taxonomies

arXiv:2606.07492v1 Announce Type: new Abstract: The ranking of recommendation algorithms is a challenging problem since model performance is sensitive to dataset characteristics such as sparsity, sequential structure, and scale. This drives a demand for a proper methodology for fair comparison between algorithms.

arXiv CS 2d ago

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

arXiv:2512.21917v3 Announce Type: replace Abstract: Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study policy alignment under an unknown and unrestricted link function.

arXiv CS 6d ago

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

arXiv:2602.10623v2 Announce Type: replace Abstract: Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT)...

arXiv CS 8d ago

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

arXiv:2606.01561v1 Announce Type: new Abstract: Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs.

arXiv CS 8d ago

Which sparkling water is the best?

The Sparkling Water Report three minds and gullets looking for the winning bubbles With my friends Manuel and Aurélien, also friends of the fizz, we set out to find which sparkling water is the best one. We limited ourselves to ones that you could readily buy in Paris, up to the limit of what we could carry. This means 14 waters, blind tested: each water was poured in an opaque glass associated to a number, the glasses were then shuffled and turned facing opposite of the drinkers.

Hacker News 7d ago

Differentially Private Preference Data Synthesis for Large Language Model Alignment

arXiv:2605.30808v1 Announce Type: new Abstract: Preference alignment is a crucial post-training step for large language models (LLMs) to ensure their outputs align with human values. However, post-training on real human preference data raises privacy concerns, as these datasets often contain sensitive user prompts and human judgments. To address this, we propose DPPrefSyn, a novel algorithm for generating differentially private (DP) synthetic preference data to enable privacy-preserving...

arXiv CS 9d ago

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

arXiv:2606.09043v1 Announce Type: new Abstract: Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and...

arXiv CS 1d ago

Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking

arXiv:2606.04387v1 Announce Type: new Abstract: Sales lead conversion in high-stakes domains (e.g., automotive, real estate) differs fundamentally from e-commerce recommendation due to prolonged decision cycles and multi-stage funnels. Traditional lead scoring methods rule-based scorecards, machine learning, or pointwise CTR models face severe challenges: sparse supervision, a semantic gap in unstructured CRM logs, and inability to capture relative lead priority.

arXiv CS 6d ago

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

arXiv:2605.17110v2 Announce Type: replace Abstract: Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using...

arXiv CS 8d ago

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

arXiv:2604.19786v2 Announce Type: replace Abstract: Humor remains difficult to evaluate in large language models (LLMs) because what makes a response funny is subjective, comparative, and shaped by interacting comedic mechanisms rather than a single scalar property. Existing humor evaluation protocols therefore tend to produce isolated scores or task-specific judgments that are difficult to compare across models. We introduce HumorRank, a tournament-based framework for ranking textual humor...

arXiv CS 8d ago