Home › Business & Finance › Capacity, Not Format: Rethinking Structured Reasoning Failures

Business & Finance

Capacity, Not Format: Rethinking Structured Reasoning Failures

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Hengxin Fan 1 min read

Key Points

Announce Type: new Abstract: Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses.

arXiv:2606.09410v1 Announce Type: new Abstract: Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

$89.3\pm1.7$% CoT (ORG) McNemar $p < 0.0001$ (ORG) AIME (ORG) JSON (ORG)

Originally published by arXiv CS Read original →

Capacity, Not Format: Rethinking Structured Reasoning Failures

Related Stories

USDA reverses course to allow pet dogs to travel from US to Mexico as it tries to slow screwworm spread

Starbucks stock is a bright spot in Wednesday's bleak market. Here's why

These in-demand jobs pay over $100,000 — and offer raises that keep ahead of inflation

Ipsos Poll Shows Majority of Adults Would Rejoin EU