Home › Knowledge Base › the Mechanistic Interpretability Benchmark

the Mechanistic Interpretability Benchmark

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation

arXiv:2505.17630v4 Announce Type: replace Abstract: Circuit localization methods aim to identify the subset of model components responsible for specific behaviors in large language models, enabling detailed mechanistic analysis. Most existing methods assume components act independently and estimate importance by perturbing each component in isolation. However, components in neural networks interact, and ignoring these interactions leads to systematic misestimation of component importance.

arXiv CS 8d ago

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

Announce Type: replace Abstract: Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $\alpha$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6,...

arXiv CS 5d ago

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

arXiv:2606.02907v1 Announce Type: new Abstract: Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $\alpha$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic...

arXiv CS 7d ago

TabCausal: Pretraining Across Causal Environments for Tabular Causal Discovery

arXiv:2605.31156v1 Announce Type: new Abstract: Causal discovery aims to recover directed causal relations from observational and interventional data, providing a basis for mechanistic understanding and reliable decision-making. Causal discovery foundation models (CDFMs) seek to amortize this problem by mapping a dataset directly to a causal graph in a single forward pass, avoiding per-dataset testing, search, or optimization. However, existing CDFMs remain limited, often failing to...

arXiv CS 9d ago

Human-Like Neural Nets by Catapulting

Human-like Neural Nets by Catapulting Speculative proposal to create artificial neural nets with human-like performance by high-learning-rate/regularization training of overparameterized NNs to trigger catapulting/grokking. Over-parameterization as a route to true generalization would resolve many outstanding mysteries of artificial versus natural intelligence. There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are...

Hacker News 3d ago

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

Announce Type: replace Abstract: Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that induce distribution shifts at policy intervention points. This renders data-driven models unreliable under distribution shift. We propose \textbf{SL-BiLEM} (Structured Learnable Behavior-in-the-Loop Epidemic Model), leveraging physical constraints as regularization for robust extrapolation.

arXiv CS 7d ago

Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

Computer Science > Artificial Intelligence [Submitted on 27 Apr 2026] Title:Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate View PDF HTML (experimental)Abstract:Multi-agent debate has been shown to improve reasoning in large language models (LLMs). However, it is compute-intensive, requiring generation of long transcripts before answering questions.

Hacker News 5d ago

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Announce Type: replace Abstract: Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, little research has mechanistically explored when and why they generalize across objects, scenes, and instructions. To probe internal representations, we train Sparse Autoencoders (SAEs) on the VLA's hidden-layer activations.

arXiv CS 1d ago

Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research

arXiv:2606.02258v1 Announce Type: new Abstract: AI co-scientists are increasingly used for scientific discovery, but current evaluations still do not test them on a key task: moving from a concrete scientific or technological problem to a plausible, mechanism-grounded solution hypothesis. This gap is especially important in materials science and, in particular, battery research, where a useful proposal must identify the relevant failure mode, propose a credible intervention, and explain why...

arXiv CS 8d ago

Whole-genome duplication shaped cell-type evolution in the vertebrate brain

Abstract The complex brains of vertebrates have more cell types than those of their closest relatives. Whole-genome duplications (WGDs) occurred during early vertebrate evolution1, but it is unclear whether the duplicated genes (ohnologues) facilitated cell-type evolution. Here using brain single-cell transcriptomes from five chordates—human2, mouse3, lizard4, lamprey5 and amphioxus—we report that many cell-type families with conserved core transcription factors in vertebrates do not show...

Nature 19h ago