Home Knowledge Base Bradley-Terry Rankings

Bradley-Terry Rankings

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Bradley-Terry Rankings for Recommender Systems Across Dataset Taxonomies

arXiv:2606.07492v1 Announce Type: new Abstract: The ranking of recommendation algorithms is a challenging problem since model performance is sensitive to dataset characteristics such as sparsity, sequential structure, and scale. This drives a demand for a proper methodology for fair comparison between algorithms.

arXiv CS 2d ago

Which sparkling water is the best?

The Sparkling Water Report three minds and gullets looking for the winning bubbles With my friends Manuel and Aurélien, also friends of the fizz, we set out to find which sparkling water is the best one. We limited ourselves to ones that you could readily buy in Paris, up to the limit of what we could carry. This means 14 waters, blind tested: each water was poured in an opaque glass associated to a number, the glasses were then shuffled and turned facing opposite of the drinkers.

Hacker News 7d ago

Nonparametric LLM Evaluation from Preference Data

arXiv:2601.21816v2 Announce Type: replace Abstract: Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used.

arXiv CS 1d ago

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

arXiv:2604.19786v2 Announce Type: replace Abstract: Humor remains difficult to evaluate in large language models (LLMs) because what makes a response funny is subjective, comparative, and shaped by interacting comedic mechanisms rather than a single scalar property. Existing humor evaluation protocols therefore tend to produce isolated scores or task-specific judgments that are difficult to compare across models. We introduce HumorRank, a tournament-based framework for ranking textual humor...

arXiv CS 8d ago

Benchmarking at the Edge of Comprehension

arXiv:2602.14307v4 Announce Type: replace Abstract: As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this...

arXiv CS 8d ago

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

arXiv:2606.09380v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in...

arXiv CS 1d ago

Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking

arXiv:2606.04387v1 Announce Type: new Abstract: Sales lead conversion in high-stakes domains (e.g., automotive, real estate) differs fundamentally from e-commerce recommendation due to prolonged decision cycles and multi-stage funnels. Traditional lead scoring methods rule-based scorecards, machine learning, or pointwise CTR models face severe challenges: sparse supervision, a semantic gap in unstructured CRM logs, and inability to capture relative lead priority.

arXiv CS 6d ago

Semantic Retrieval for Product Search in E-Commerce

arXiv:2606.01504v1 Announce Type: new Abstract: Semantic retrieval in e-commerce must handle short, noisy, and colloquial queries over large product catalogs with fine-grained attribute distinctions. We present a Siamese LLM dual-encoder trained through a two-stage pipeline: contrastive learning with a false-negative margin mask to prevent penalization of near-duplicate products, followed by Relative Odds Alignment for Retrieval (ROAR), a preference optimization objective that extends...

arXiv CS 8d ago

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

arXiv:2605.17110v2 Announce Type: replace Abstract: Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using...

arXiv CS 8d ago