Home Knowledge Base leaderboard

leaderboard

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

arXiv:2606.08679v1 Announce Type: cross Abstract: Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains...

arXiv CS 1d ago

Pluralistic Leaderboards

arXiv:2606.02547v1 Announce Type: new Abstract: Recent leaderboard-based evaluations of large language models aggregate user feedback by fitting a Bradley--Terry model to pairwise comparisons, producing a single global ranking based on a latent quality score. While appealing for its simplicity, this approach is incompatible with heterogeneous preferences: when LLMs are used across diverse tasks and use cases, users who favor fundamentally different model behaviors can be systematically...

arXiv CS 8d ago

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

arXiv:2604.19786v2 Announce Type: replace Abstract: Humor remains difficult to evaluate in large language models (LLMs) because what makes a response funny is subjective, comparative, and shaped by interacting comedic mechanisms rather than a single scalar property. Existing humor evaluation protocols therefore tend to produce isolated scores or task-specific judgments that are difficult to compare across models. We introduce HumorRank, a tournament-based framework for ranking textual humor...

arXiv CS 8d ago

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

arXiv:2503.14229v4 Announce Type: replace Abstract: Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions,...

arXiv CS 1d ago

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

arXiv:2605.28508v2 Announce Type: replace Abstract: Existing AI evaluation practices often fail to capture how systems actually perform in low-resource environments, where operational constraints shape usability as much as model quality. Through a structured analysis of existing benchmark families across speech, chat/RAG, and vision systems, we identify critical gaps between laboratory evaluation practices and real-world deployment conditions in low-resource environments. We argue that the...

arXiv CS 8d ago

Englishman Smith tops leaderboard at Charles Schwab Challenge

Highlights of day two from the Charles Schwab Challenge at Colonial Country Club in Fort Worth, Texas.

Sky Sports Football 11d ago

Nonparametric LLM Evaluation from Preference Data

arXiv:2601.21816v2 Announce Type: replace Abstract: Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used.

arXiv CS 1d ago

Latent Performance Profiling of Large Language Models

Announce Type: replace Abstract: Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture what a model outputs on fixed test sets, not how it processes...

arXiv CS 9d ago

A Chinese robotics start-up beat Nvidia on a global AI ranking. Is a new tech war brewing?

A Chinese robotics start-up beat Nvidia on a global AI ranking. Is a new tech war brewing? Spirit AI says its foundation model for embodied intelligence is the first from China to top the RoboArena global leaderboard As artificial intelligence steps out of the digital realm and into the real world, the race to build the embodied “brains” powering next-generation robots has become the newest battleground in tech competition between China and the United States.

South China Morning Post 6d ago

Position: State-of-the-Art Claims Require State-of-the-Art Evidence

arXiv:2605.17273v3 Announce Type: replace Abstract: State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark evaluations, where models are ranked by aggregate scores across tasks. Public benchmarks or leaderboards are the most visible instance, but the same structure appears in paper tables throughout the literature.

arXiv CS 6d ago