How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

arXiv CS Tuesday 02 June 2026, 04:00 UTC By William Dorrell 1 min read

Key Points

Announce Type: cross Abstract: Sparse Autoencoders (SAEs) have found success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly SAEs extract, and, correspondingly, the scientific conclusions we can draw from them, are not obvious. Empirically, the proof is in the pudding: SAEs learn interpretable features.

arXiv:2606.02385v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) have found success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly SAEs extract, and, correspondingly, the scientific conclusions we can draw from them, are not obvious. Empirically, the proof is in the pudding: SAEs learn interpretable features. Theoretically, we lack a clear account of what properties a 'concept' must satisfy for an SAE to extract it. There has been extensive identifiability work studying the conditions under which sparse coding recovers ground-truth features; however, these approaches tends to focus on simple data-generating models (e.g. sparse independent features) which poorly approximate the internet-swallowing language-model representations on which SAEs are trained. Here, avoiding data-generating models, we ask simply what properties any dictionary learning optimum must satisfy. Concretely, we extend local optimality analyses (Gribonval & Schnass, 2010) to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE features to their distributions. We use these constraints to explain a range of observed SAE behaviours - hierarchical splitting & absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Finally, we construct a novel large-dictionary convex problem and explore the wide atom-per-datapoint limit. In sum, we hope to tease model assumptions from unexpected observations, letting us learn more from SAEs' successes and provide principles for designing their successors.

SAE (ORG)

Originally published by arXiv CS Read original →

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

Related Stories

Google will save your Lens photos, Search Live recordings, and Translate audio for AI training

ASML to Cut Fewer Jobs Than Planned After Union Negotiations

Engadget Podcast: WWDC 2026 thoughts from Apple Park

German court holds Google liable for false AI Overview answers