Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Francesco Sovrano, Gabriele Dominici, Marc Langheinrich 1 min read

Key Points

arXiv:2605.03058v2 Announce Type: replace Abstract: A central goal of explainable AI is to express large language model (LLM) decision logic symbolically and ground it in internal mechanisms. Existing rule-extraction methods usually learn ungrounded symbolic surrogates, while mechanistic interpretability links behavior to neurons but often requires hand-crafted hypotheses and costly interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by localizing sparse agonist activations whose ablation disrupts rule-related behavior. MechaRule rests on two findings. First, in a fixed baseline/flip regime, sparse agonist effects can exhibit overtopping: a few high-effect activations remain detectable within larger groups, dominate weaker ones, and flip many of the same examples. In such regimes, adaptive group testing with confidence-guided conservative pruning requires O(k log(N/k) + k) interventions over N candidates when k << N are agonists. Second, agonists are localized more reliably on data splits aligned with close-to-faithful rule behavior; spectral splits provide a rule-free fallback, whereas unfaithful splits degrade localization. Empirically, on arithmetic and jailbreaking, MechaRule recalls 97.0% of highest-effect agonists in matched brute-force validations at only 2.14% of exhaustive-ablation cost on average. Ablating the localized agonists eliminates 97.6--100.0% of eligible correct arithmetic answers and jailbreaks, and can correct arithmetic errors or induce jailbreaks by up to 72.8% and 32.5%.

Neuron-Anchored Rule Extraction for Large Language Models (ORG) AI (ORG) LLM (ORG) MechaRule (ORG)

Originally published by arXiv CS Read original →

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

Related Stories

PLL Power Rankings ahead of Week 5: Whipsnakes tak...

Fly-tippers should face higher fines, Reform UK urges

Inside Anthropic, the $965 Billion AI Titan

Belgium rejects US World Cup demand for Congo travel ban amid Ebola outbreak