Home Knowledge Base PromptEval

PromptEval

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

arXiv:2606.08679v1 Announce Type: cross Abstract: Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains...

arXiv CS 1d ago