Home › Knowledge Base › Mechanistic Interpretability

Mechanistic Interpretability

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

Announce Type: replace Abstract: Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA).

arXiv CS 9d ago

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

arXiv:2606.06333v1 Announce Type: new Abstract: Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of...

arXiv CS 5d ago

Discovering a Zeta Map Algorithm on Dyck Paths via Mechanistic Interpretability

Announce Type: new Abstract: Machine learning is increasingly used in mathematical discovery, but in mathematics the desired output is often not a prediction itself, but an explicit construction that can be checked independently. We study this setting through the zeta map on Dyck paths, a classical bijection in the combinatorics of the q,t-Catalan numbers. We train a deliberately small one-layer, one-head encoder-decoder transformer on this map and analyze its learned computation using...

arXiv CS 9d ago

Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

arXiv:2605.03217v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation.

arXiv CS 5d ago

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Announce Type: replace Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence...

arXiv CS 1d ago

Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation

new Abstract: Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the...

arXiv CS 9d ago

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

Announce Type: new Abstract: In mechanistic interpretability, attention heads are commonly elevated to role claims (e.g., "this head represents addition") when they are necessary for a behavior, encode it linearly, and recover that behavior when restored after ablation. We show this evidence is insufficient: across three 7-8B instruction-tuned models and five computation families, heads passing all three checks routinely fail to transfer the computation when their activations are patched...

arXiv CS 1d ago

A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models

arXiv:2601.17952v2 Announce Type: replace Abstract: Interpretability remains a key challenge for deploying language models (LM) in clinical settings such as progression diagnosis of Alzheimer disease, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of Transformer-Based LM and LLM representations, while mechanistic interpretability approaches lack direct alignment...

arXiv CS 8d ago

LLMs are not the black box you were promised

LLMs are not the "black box" you were promised. Mechanistic interpretability — peering into a neural network to reverse engineer its inner workings — has made major strides. Anthropic's On the Biology of a Large Language Model (2025) is a landmark in that effort.

Hacker News 7d ago

Trajectory Geometry of Transformer Representations Across Layers

arXiv:2606.09287v1 Announce Type: new Abstract: Understanding how transformer representations evolve across layers, not merely what they encode, remains an open problem in mechanistic interpretability. We recast the transformer forward pass as a discrete population trajectory through a high-dimensional representation manifold, drawing on geometric tools from computational neuroscience. Rather than probing for pre-specified features, we characterize trajectory geometry using five metrics...

arXiv CS 1d ago