Item

AlignFix: Fixing Adversarial Perturbations by Agreement Checking for Adversarial Robustness against Black-box Attacks

Nirala, Ashutosh
Tian, Jin
Fakorede, Olukorede
Atsague, Modeste
Supervisor
Department
Machine Learning
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Motivated by the vulnerability of feed-forward visual pathways to adversarial like inputs and the overall robustness of biological perception, commonly attributed to top-down feedback processes, we propose a new defense method AlignFix. We exploit the fact that natural and adversarially trained models rely on distinct feature sets for classification. Notably, naturally trained models, referred to as weakM, retain commendable accuracy against adversarial examples generated using adversarially trained models referred to as strongM, and viceversa. Further these two models tend to agree more on their prediction if input is nudged towards correct class prediction. Leveraging this, AlignFix initially perturbs the input toward the class predicted by a naturally trained model, using a joint loss from both weakM and strongM. If this retains or leads to agreement, the prediction is accepted, otherwise the original strongM output is used. This mechanism is highly effective against leading SQA (Score-based Query Attacks) as well as decision-based and transfer-based black-box attacks. We demonstrate its effectiveness through comprehensive experiments across various datasets (CIFAR and ImageNet) and architectures (ResNet and ViT).
Citation
A. K. Nirala, J. Tian, O. Fakorede, and M. Atsague, “AlignFix: Fixing Adversarial Perturbations by Agreement Checking for Adversarial Robustness against Black-box Attacks,” Trans. Mach. Learn. Res., vol. 2025-August, 2025.
Source
Transactions on Machine Learning Research
Conference
Keywords
Subjects
Source
Publisher
Transactions on Machine Learning Research
DOI
Full-text link