the Mechanistic Interpretability Benchmark
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation
arXiv:2505.17630v4 Announce Type: replace Abstract: Circuit localization methods aim to identify the subset of model components responsible for specific behaviors in large language models, enabling detailed mechanistic analysis. Most existing methods assume components act independently and estimate importance by perturbing each component in isolation. However, components in neural networks interact, and ignoring these interactions leads to systematic misestimation of component importance.
Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States
Announce Type: replace Abstract: Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $\alpha$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6,...
Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States
arXiv:2606.02907v1 Announce Type: new Abstract: Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $\alpha$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic...
TabCausal: Pretraining Across Causal Environments for Tabular Causal Discovery
arXiv:2605.31156v1 Announce Type: new Abstract: Causal discovery aims to recover directed causal relations from observational and interventional data, providing a basis for mechanistic understanding and reliable decision-making. Causal discovery foundation models (CDFMs) seek to amortize this problem by mapping a dataset directly to a causal graph in a single forward pass, avoiding per-dataset testing, search, or optimization. However, existing CDFMs remain limited, often failing to...
Human-Like Neural Nets by Catapulting
Human-like Neural Nets by Catapulting Speculative proposal to create artificial neural nets with human-like performance by high-learning-rate/regularization training of overparameterized NNs to trigger catapulting/grokking. Over-parameterization as a route to true generalization would resolve many outstanding mysteries of artificial versus natural intelligence. There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are...
SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation
Announce Type: replace Abstract: Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that induce distribution shifts at policy intervention points. This renders data-driven models unreliable under distribution shift. We propose \textbf{SL-BiLEM} (Structured Learnable Behavior-in-the-Loop Epidemic Model), leveraging physical constraints as regularization for robust extrapolation.
Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate
Computer Science > Artificial Intelligence [Submitted on 27 Apr 2026] Title:Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate View PDF HTML (experimental)Abstract:Multi-agent debate has been shown to improve reasoning in large language models (LLMs). However, it is compute-intensive, requiring generation of long transcripts before answering questions.
Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models
Announce Type: replace Abstract: Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, little research has mechanistically explored when and why they generalize across objects, scenes, and instructions. To probe internal representations, we train Sparse Autoencoders (SAEs) on the VLA's hidden-layer activations.
Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research
arXiv:2606.02258v1 Announce Type: new Abstract: AI co-scientists are increasingly used for scientific discovery, but current evaluations still do not test them on a key task: moving from a concrete scientific or technological problem to a plausible, mechanism-grounded solution hypothesis. This gap is especially important in materials science and, in particular, battery research, where a useful proposal must identify the relevant failure mode, propose a credible intervention, and explain why...
Whole-genome duplication shaped cell-type evolution in the vertebrate brain
Abstract The complex brains of vertebrates have more cell types than those of their closest relatives. Whole-genome duplications (WGDs) occurred during early vertebrate evolution1, but it is unclear whether the duplicated genes (ohnologues) facilitated cell-type evolution. Here using brain single-cell transcriptomes from five chordates—human2, mouse3, lizard4, lamprey5 and amphioxus—we report that many cell-type families with conserved core transcription factors in vertebrates do not show...