Home Knowledge Base Pythia

Pythia

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

arXiv:2606.03280v2 Announce Type: replace Abstract: Recent work shows that language models can transmit behavioural traits through hidden signals in generated data during training. We ask whether a different activation-mediated channel is viable: can one language model communicate a useful intermediate reasoning state to another at inference time through a post-hoc linear activation bridge, rather than through a textual or structured-token relay?

arXiv CS 2d ago

A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

Announce Type: new Abstract: Recent work shows that language models can transmit behavioural traits through hidden signals in generated data during training. We ask whether a more direct and stricter channel is also viable: can one language model communicate useful intermediate reasoning state to another at inference time by translating and injecting hidden activations, rather than by passing natural-language text?

arXiv CS 7d ago

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

arXiv:2606.02378v1 Announce Type: new Abstract: We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture families (dense transformer, mixture-of-experts) and two pretraining corpora (The Pile, DCLM): Pythia 1B, OLMo 1B-0724-hf, and OLMoE 1B-7B-0924. At each of 10 log-spaced revisions per model -- 30 mechanistic-interpretability runs in total -- we apply a participation-ratio (PR) spectral signal and an all-head...

arXiv CS 8d ago

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Announce Type: replace Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence...

arXiv CS 1d ago

Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits

arXiv:2605.30836v2 Announce Type: replace Abstract: Recent SVD based compression methods for large language models like SVD LLM and Basis Sharing can be unified under one optimization problem. While mathematical proofs and tests on Pythia models show this unified approach improves weight reconstruction error by up to 46% percent it fails in practical tasks.

arXiv CS 1d ago

Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits

Announce Type: new Abstract: Recent SVD based compression methods for large language models like SVD LLM and Basis Sharing can be unified under one optimization problem. While mathematical proofs and tests on Pythia models show this unified approach improves weight reconstruction error by up to 46% percent it fails in practical tasks.

arXiv CS 9d ago

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

arXiv:2605.30381v1 Announce Type: new Abstract: Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety. While strategic deception is the primary long-term concern, synthetic dishonesty - induced via direct optimization on incorrect answers - provides a controlled testbed for studying the representational basis of learned deception.

arXiv CS 9d ago

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral...

arXiv CS 1d ago

How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models

new Abstract: Quantization is a standard path to deploying large language models, and a quantized model is typically judged acceptable when its perplexity or downstream accuracy stays close to the full-precision original. Whether the model still computes in the same way, or whether the interpretable features identified in the full-precision model survive weight rounding, is rarely tested, even as safety audits and steering interventions increasingly rely on those features. We ask whether...

arXiv CS 7d ago

Perplexity Can Miss SAE Feature Damage Under Quantization

Announce Type: replace Abstract: Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or downstream accuracy remains close to the full-precision original. But behavioral parity need not imply feature fidelity: the sparse-autoencoder (SAE) features used to interpret a full-precision model may change after weight rounding.

arXiv CS 2d ago