Home Knowledge Base Steer

Steer

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

arXiv:2605.24942v2 Announce Type: replace Abstract: Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear interpolation to nonlinear methods such as angular and kernelized steering, which define intervention transformations without learning an explicit geometry over paths in activation space. Freshly introduced geometry-aware manifold methods do learn such a geometry, but require labelled class centroids...

arXiv CS 1d ago

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Announce Type: replace Abstract: Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts.

arXiv CS 8d ago

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral...

arXiv CS 1d ago

Endogenous Resistance to Activation Steering in Language Models

arXiv:2602.06941v2 Announce Type: replace Abstract: Large language models can recover mid-generation from task-misaligned activation steering, producing explicit verbal restarts (e.g., ``wait, that's not right'') and continuing on-topic even while the steering perturbation remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B exhibits explicit ESR at \llamaseventyEsrRate\%, with smaller...

arXiv CS 2d ago

Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

arXiv:2606.08682v1 Announce Type: new Abstract: Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter updates required by finetuning. Meanwhile, recent work has identified emergent...

arXiv CS 1d ago

ATLAS: Verifier-Guided Adaptive Latent Activation Steering for Efficient LLM Reasoning

arXiv:2601.03093v2 Announce Type: replace Abstract: Recent work on activation and latent steering has demonstrated that modifying internal representations can effectively guide large language models (LLMs) toward improved reasoning and efficiency without updating model parameters. However, most existing approaches rely on fixed steering policies and static intervention strengths, which limit their robustness across problem instances and often result in over- or under-steering. We propose...

arXiv CS 1d ago

Expert-Aware Refusal Steering

Announce Type: new Abstract: Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is...

arXiv CS 6d ago

Task-Dependent Modulation of Feedback Control in Human Steering

We examined whether human steering behavior conforms to optimal feedback control (OFC) principles when driving a vehicle through sequences of upcoming gates varying in width (narrow/wide) relative to the vehicle's size, while occasional lateral velocity perturbations elicited corrective steering responses. In 24 participants, three predictions of OFC were tested: (1) greater positional variability when passing wide gates; (2) reduced corrective steering (lower feedback gains) to...

bioRxiv 12d ago

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

arXiv:2605.31183v1 Announce Type: new Abstract: Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines.

arXiv CS 9d ago

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

arXiv:2605.24535v2 Announce Type: replace Abstract: Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trigger refusal while preserving benign utility. However, existing steering methods are fundamentally supervised and tied to a static, limited training set, whereas real jailbreaks evolve and are often out-of-distributed from the training set, leading to...

arXiv CS 9d ago