Generative Augmentation Hashing for Few-shot Cross-Modal Retrieval
Li, Fengling ; Wang, Zequn ; Wang, Tianshi ; Zhu, Lei ; Chang, Xiaojun
Li, Fengling
Wang, Zequn
Wang, Tianshi
Zhu, Lei
Chang, Xiaojun
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Deep cross-modal hashing has demonstrated strong performance in large-scale retrieval but remains challenging in few-shot scenarios due to limited data and weak cross-modal alignment. We propose Generative Augmentation Hashing (GAH), a new framework that synergizes Visual-Language Models (VLMs) and generation-driven hashing to address these limitations. GAH first introduces a cycle generative augmentation mechanism: VLMs generate descriptive textual captions for images, which, combined with label semantics, guide diffusion models to synthesize semantically aligned images via inconsistency filtering. These images then regenerate coherent textual descriptions through VLMs, forming a self-reinforcing cycle that iteratively expands cross-modal data. To resolve the diversity-alignment trade-off in augmentation, we design cross-modal perturbation enhancement, injecting synchronized perturbations with controlled noise to preserve inter-modal semantic relationships while enhancing robustness. Finally, GAH employs dual-level adversarial hash learning, where adversarial alignment of modality-specific and shared latent spaces optimizes both cross-modal consistency and discriminative hash code generation, effectively bridging heterogeneous gaps. Extensive experiments on benchmark datasets show that GAH outperforms state-of-the-art methods in few-shot cross-modal retrieval, achieving significant improvements in retrieval accuracy. Our source codes and datasets are available at https://github.com/xiaolaohuuu/GAH.
Citation
F. Li, Z. Wang, T. Wang, L. Zhu and X. Chang, "Generative Augmentation Hashing for Few-shot Cross-Modal Retrieval," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2025.3588769
Source
IEEE Transactions on Circuits and Systems for Video Technology
Conference
Keywords
Cross-modal retrieval, few-shot learning, generative augmentation, perturbation enhancement
Subjects
Source
Publisher
IEEE
