Loading...
Thumbnail Image
Item

Sycophancy Hides Linearly in the Attention Heads

Genadi, Rifo Ahmad
Nwadike, Munachiso Samuel
Mukhituly, Nurdaulet
Hiraoka, Tatsuya
AlQuabeh, Hilal
Inui, Kentaro
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We find that correct-to-incorrect sycophancy signals are most linearly accessible within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified “truthful” directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations. Code will be released upon publication.
Citation
R.A. Genadi, M.S. Nwadike, N. Mukhituly, T. Hiraoka, H. AlQuabeh, K. Inui, "Sycophancy Hides Linearly in the Attention Heads," 2026, pp. 6896-6912.
Source
Conference
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Keywords
Subjects
Source
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Publisher
Association for Computational Linguistics
Full-text link