Item

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Gao, Lang
Geng, Jiahui
Zhang, Xiangliang
Nakov, Preslav Ivanov
Chen, Xiuying
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Jailbreaking in large language models (LLMs) poses major security risks by tricking models into generating harmful text. However, there is limited understanding of how jailbreaking operates, which makes it difficult to develop effective defenses. In this work, we conduct a large-scale analysis of seven jailbreak methods and uncover that inconsistencies in previous studies arise from insufficient observation samples. Our analysis reveals that jailbreaks shift harmful activations beyond a defined safety boundary, where LLMs become less sensitive to harmful information. We also find that the low and the middle layers are critical in driving these shifts, while deeper layers play a lesser role. Leveraging these insights, we propose a novel defense mechanism called Activation Boundary Defense (ABD), which adaptively constrains activations within the safety boundary. To further optimize performance, we use Bayesian optimization to select the most effective layers for ABD application and confirm that the low and middle layers have the greatest impact, consistent with our earlier observations. Experiments across multiple benchmarks demonstrate that ABD achieves an average defense success rate (DSR) of over 98% against various jailbreak attacks, with less than 2% impact on the model's overall capabilities.
Citation
L. Gao, J. Geng, X. Zhang, P. Nakov, and X. Chen, “Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models,” pp. 25378–25398, Aug. 2025, doi: 10.18653/V1/2025.ACL-LONG.1233.
Source
Proceedings of the Annual Meeting of the Association for Computational Linguistics
Conference
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Keywords
Multilingual Instruction-Following, Large Language Models, Localization vs Machine Translation, Low-Resource Languages, 30-Language Benchmark, Cultural and Linguistic Adaptation, Model Scale Impact, Instruction-Constraint Adherence
Subjects
Source
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Publisher
Association for Computational Linguistics
Full-text link