VisualRAG: Knowledge-Guided Retrieval Augmentation for Image-Text Matching
Wang, Hengchang ; Liu, Li ; Zhang, Huaxiang ; Zhu, Lei ; Chang, Xiaojun ; Du, Hao
Wang, Hengchang
Liu, Li
Zhang, Huaxiang
Zhu, Lei
Chang, Xiaojun
Du, Hao
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Image-text matching as a fundamental cross-modal understanding task presents unique challenges in weakly-aligned scenarios. Such data typically feature highly abstract textual captions with sparse entity references, creating a significant semantic gap with visual content. Current mainstream methods, primarily designed for strongly aligned data pairs, employ dynamic modeling or multi-dimensional similarity computation to achieve feature space mapping. However, they struggle with information asymmetry and modal heterogeneity in weakly aligned cases. To address this, we propose a Visual Perception Knowledge Enhancement (VPKE) framework. Unlike existing methods based on strong alignment assumptions, this framework mines latent image semantics through vision-language models and generates auxiliary captions, overcoming the information bottleneck of traditional text modalities. Its core innovation lies in an adaptive knowledge distillation mechanism that combines retrieval-augmented generation (RAG) with key entity extraction. This mechanism effectively filters noise when introducing external knowledge while optimizing cross-modal feature integration. The framework employs multi-level similarity evaluation to dynamically adjust fusion weights among original text, key entities, and auxiliary captions, enabling adaptive integration of diverse semantic features and significantly improving model flexibility. Additionally, multi-scale feature extraction further enhances cross-modal representation capabilities. Experimental results show that the proposed method performs excellently in image-text retrieval tasks on the MSCOCO and Flickr30K datasets, validating its effectiveness.
Citation
H. Wang, L. Liu, H. Zhang, L. Zhu, X. Chang and H. Du, "VisualRAG: Knowledge-Guided Retrieval Augmentation for Image-Text Matching," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2025.3597097
Source
IEEE Transactions on Circuits and Systems for Video Technology
Conference
Keywords
Image-Text Matching, Knowledge Enhancement, Large Language Model, Modality Heterogeneity
Subjects
Source
Publisher
IEEE
