OckBench: Measuring the Efficiency of LLM Reasoning

arXiv CS Thursday 04 June 2026, 04:00 UTC By Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu 1 min read

Key Points

arXiv:2511.05722v3 Announce Type: replace Abstract: Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with similar accuracy can exhibit up to a \textbf{5.0$\times$} difference in token length, leading to massive gap of model reasoning ability. Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency. Thus, we introduce OckBench, the first benchmark that jointly measures accuracy and token efficiency across reasoning and coding tasks. Our evaluation reveals that token efficiency remains largely unoptimized across current models, significantly inflating serving costs and latency. These findings provide a concrete roadmap for the community to optimize the latent reasoning ability, token efficiency. Ultimately, we argue for an evaluation paradigm shift: tokens must not be multiplied beyond necessity. Our benchmarks are available at https://ockbench.github.io/.

OckBench (ORG) Gemini (ORG)

Originally published by arXiv CS Read original →

Nasa chief defends choice of all-male Artemis III crew Critics fear the agency is following Trump’s order to eliminate diversity and inclusion efforts despite its vow to put a woman on the moon Nasa’s administrator Jared Isaacman on Wednesday defended the make-up of the space agency’s latest Artemis crew, an all-male group. The nominations have earned criticism that Nasa may have acted in accordance with US President Donald Trump’s direction to eliminate diversity and inclusion efforts....

South China Morning Post 31m ago

The asteroid that wiped out the dinosaurs may have created a vast underground habitat for life that lasted 8 million years

The asteroid that wiped out the dinosaurs may have created a vast underground habitat for life that lasted 8 million years The Chicxulub impact may have actually helped nurture life while destroying it, too. The asteroid impact that doomed the dinosaurs may also have built one of Earth's longest-lasting underground ecosystems. When a roughly 6-mile-wide (10-kilometer-wide) asteroid slammed into what is now Mexico's Yucatán Peninsula 66 million years ago, it triggered a global catastrophe...

Space.com 33m ago

OckBench: Measuring the Efficiency of LLM Reasoning

Related Stories

SpaceX Leaves Some Banks Peeved at Junior Roles in IPO Lineup

'Worrying' pollution in Cotswolds river - volunteers

Nasa chief defends choice of all-male Artemis III crew

The asteroid that wiped out the dinosaurs may have created a vast underground habitat for life that lasted 8 million years