Loading...
Thumbnail Image
Item

From Evasion to Concealment: Stealthy Knowledge Unlearning for LLMs

Gu, Tianle
Huang, Kexin
Luo, Ruilin
Yao, Yuanqi
Chen, Xiuying
Yang, Yujiu
Teng, Yan
Wang, Yingchun
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
LLM Unlearning plays a crucial role in removing sensitive information from language models to mitigate potential misuse. However, previous approaches often treat nonsensical responses or template-based refusals (e.g., “Sorry, I cannot answer.”) as the unlearning target, which can give the impression of deliberate information suppression, making the process even more vulnerable to attacks and jailbreaks. Moreover, most methods rely on auxiliary models or retaining datasets, which adds complexity to the unlearning process. To address these challenges, we propose MEOW, a streamlined and stealthy unlearning method that eliminates the need for auxiliary models or retaining data while avoiding leakage through its innovative use of inverted facts. These inverted facts are generated by an offline LLM and serve as fine-tuning labels. Meanwhile, we introduce MEMO, a novel metric that measures the model’s memorization, to select optimal fine-tuning targets. The use of inverted facts not only maintains the covert nature of the model but also ensures that sensitive information is effectively forgotten without revealing the target data. Evaluated on the ToFU Knowledge Unlearning dataset using Llama2-7B-Chat and Phi-1.5, MEOW outperforms baselines in forgetting quality while preserving model utility. MEOW also maintains strong performance across NLU and NLG tasks and demonstrates superior resilience to attacks, validated via the Min-K% membership inference method.
Citation
T. Gu, K. Huang, R. Luo, Y. Yao, X. Chen, Y. Yang, Y. Teng, Y. Wang, "From Evasion to Concealment: Stealthy Knowledge Unlearning for LLMs," 2025, pp. 10261-10279.
Source
Proceedings of the Annual Meeting of the Association for Computational Linguistics
Conference
Findings of the Association for Computational Linguistics: ACL 2025
Keywords
Subjects
Source
Findings of the Association for Computational Linguistics: ACL 2025
Publisher
Association for Computational Linguistics (ACL)
Full-text link