Sparse Autoencoders
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Toward Identifiable Sparse Autoencoders
Announce Type: new Abstract: Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs are highly unstable: different training runs are likely to produce different concept dictionaries and sparse codes. We characterize the model properties that hinder the stability of real-world SAEs, and address each of these problems...
Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)
Announce Type: replace Abstract: Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a large fraction of features are never activated and are unstable. Despite variants of SAEs that attempt to mitigate these issues, they require additional data, resampling, or training.
Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects
Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral...
On the Relationship Between Activation Outliers and Feature Death in Sparse Autoencoders
arXiv:2605.31518v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many learned features never activate, a problem called feature death that wastes dictionary capacity and can reintroduce superposition. Death rates vary dramatically between models: near-zero on GPT-2, over 70% on AlphaFold3 with identical configurations. We find that dimension-level activation outliers (dimensions whose mean magnitude is large...
AdaptiveK: Complexity-Driven Sparse Autoencoders for Interpretable Language Model Representations
arXiv:2508.17320v3 Announce Type: replace Abstract: Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically...
Concept-SAE: A Controllable and Invertible Concept Interface for Sparse Autoencoders
arXiv:2509.22015v2 Announce Type: replace Abstract: Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, providing a powerful lens for passive feature discovery. However, this passive nature makes it difficult to systematically evaluate or analyze concepts that users explicitly care about. We introduce Concept-SAE, a framework that augments SAEs with a structured and controllable interface for probing user-defined concepts.
Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines
arXiv:2605.31183v1 Announce Type: new Abstract: Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines.
Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability
arXiv:2606.06333v1 Announce Type: new Abstract: Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of...
A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders
Announce Type: new Abstract: We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). While SAEs improve interpretability of neural networks by learning sparse feature representations, a principled definition of ''concept'' and ''learning'' remains unclear. We formalize concepts as sets of data points and cast concept learning as a set-alignment problem between human-defined and model-induced concepts.
Step-Level Sparse Autoencoder for Reasoning Process Interpretation
arXiv:2603.03031v2 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning...