Home Knowledge Base Pythia-70M

Pythia-70M

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models

new Abstract: Quantization is a standard path to deploying large language models, and a quantized model is typically judged acceptable when its perplexity or downstream accuracy stays close to the full-precision original. Whether the model still computes in the same way, or whether the interpretable features identified in the full-precision model survive weight rounding, is rarely tested, even as safety audits and steering interventions increasingly rely on those features. We ask whether...

arXiv CS 7d ago

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral...

arXiv CS 1d ago

Perplexity Can Miss SAE Feature Damage Under Quantization

Announce Type: replace Abstract: Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or downstream accuracy remains close to the full-precision original. But behavioral parity need not imply feature fidelity: the sparse-autoencoder (SAE) features used to interpret a full-precision model may change after weight rounding.

arXiv CS 2d ago