Steer
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering
arXiv:2605.24942v2 Announce Type: replace Abstract: Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear interpolation to nonlinear methods such as angular and kernelized steering, which define intervention transformations without learning an explicit geometry over paths in activation space. Freshly introduced geometry-aware manifold methods do learn such a geometry, but require labelled class centroids...
Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Announce Type: replace Abstract: Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts.
Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects
Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral...
Endogenous Resistance to Activation Steering in Language Models
arXiv:2602.06941v2 Announce Type: replace Abstract: Large language models can recover mid-generation from task-misaligned activation steering, producing explicit verbal restarts (e.g., ``wait, that's not right'') and continuing on-topic even while the steering perturbation remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B exhibits explicit ESR at \llamaseventyEsrRate\%, with smaller...
Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation
arXiv:2606.08682v1 Announce Type: new Abstract: Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter updates required by finetuning. Meanwhile, recent work has identified emergent...
ATLAS: Verifier-Guided Adaptive Latent Activation Steering for Efficient LLM Reasoning
arXiv:2601.03093v2 Announce Type: replace Abstract: Recent work on activation and latent steering has demonstrated that modifying internal representations can effectively guide large language models (LLMs) toward improved reasoning and efficiency without updating model parameters. However, most existing approaches rely on fixed steering policies and static intervention strengths, which limit their robustness across problem instances and often result in over- or under-steering. We propose...
Expert-Aware Refusal Steering
Announce Type: new Abstract: Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is...
Task-Dependent Modulation of Feedback Control in Human Steering
We examined whether human steering behavior conforms to optimal feedback control (OFC) principles when driving a vehicle through sequences of upcoming gates varying in width (narrow/wide) relative to the vehicle's size, while occasional lateral velocity perturbations elicited corrective steering responses. In 24 participants, three predictions of OFC were tested: (1) greater positional variability when passing wide gates; (2) reduced corrective steering (lower feedback gains) to...
Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines
arXiv:2605.31183v1 Announce Type: new Abstract: Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines.
Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation
arXiv:2605.24535v2 Announce Type: replace Abstract: Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trigger refusal while preserving benign utility. However, existing steering methods are fundamentally supervised and tied to a static, limited training set, whereas real jailbreaks evolve and are often out-of-distributed from the training set, leading to...