The Illusion of Generalization in Tabular Language Models

arXiv CS Monday 01 June 2026, 04:00 UTC By Aditya Gorla, Ratish Puduppully 1 min read

Key Points

arXiv:2602.04031v2 Announce Type: replace Abstract: Tabular Language Models (TLMs) have been claimed to achieve strong generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.

The Illusion of Generalization in (ORG) TLM (ORG) UniPredict (ORG)

Originally published by arXiv CS Read original →

The Illusion of Generalization in Tabular Language Models

Related Stories

Musk Stock Fans Say ‘The More, The Better’ in SpaceX IPO Frenzy

Whale graveyard dating back five million years discovered

Whale graveyard dating back five million years discovered

SpaceX Leaves Some Banks Peeved at Junior Roles in IPO Lineup