Loading...
Against The Achilles' Heel: A Survey on Red Teaming for Generative Models
Lin, Lizhi ; Mu, Honglin ; Zhai, Zenan ; Wang, Minghan ; Wang, Yuxia ; Wang, Renxi ; Gao, Junjie ; Zhang, Yixuan ; Che, Wanxiang ; Baldwin, Timothy ... show 2 more
Lin, Lizhi
Mu, Honglin
Zhai, Zenan
Wang, Minghan
Wang, Yuxia
Wang, Renxi
Gao, Junjie
Zhang, Yixuan
Che, Wanxiang
Baldwin, Timothy
Files
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Generative models are rapidly gaining popularity and being integrated into everyday applications, raising concerns over their safe use as various vulnerabilities are exposed. In light of this, the field of red teaming is undergoing fast-paced growth, highlighting the need for a comprehensive survey covering the entire pipeline and addressing emerging topics. Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed the “searcher” framework to unify various automatic red teaming approaches. Moreover, our survey covers novel areas including multimodal attacks and defenses, risks around LLM-based agents, overkill of harmless queries, and the balance between harmlessness and helpfulness.
Citation
L. Lin, H. Mu, Z. Zhai, M. Wang, Y. Wang, R. Wanget al., "Against the achilles' heel: a survey on red teaming for generative models", Journal of Artificial Intelligence Research, vol. 82, p. 687-775, 2025. https://doi.org/10.1613/jair.1.17654
Source
Journal of Artificial Intelligence Research
Conference
Keywords
Subjects
Source
Publisher
AI ACCESS FOUNDATION
