Home Knowledge Base PolyPythias 410M

PolyPythias 410M

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

arXiv:2605.30916v1 Announce Type: new Abstract: AI benchmarks have well-documented limitations, with prior work examining contamination, saturation, and construct underspecification. Aggregation has received far less attention: benchmarks are typically summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable. We model benchmarking as a multitask principal-agent game and show that the welfare loss from a benchmark is determined jointly by...

arXiv CS 9d ago