Item

Aleatoric-Epistemic Joint Uncertainty Modeling for Cross-Modal Retrieval

Chang, Tianyu
Song, Peipei
Yang, Xun
Guo, Dan
Chang, Xiaojun
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recently, the cross-modal retrieval task has gained significant attention with the advent of large-scale vision-language pretraining models, e.g., CLIP. These methods typically map the vision and language modalities into a shared embedding space and then build similarity relations based on the joint feature representations. Despite tremendous progress in this field, most existing methods still suffer from unreliable retrieval results caused by data and model uncertainties, which can arise from inherent data ambiguity or noisy pairs. In this article, we propose a novel cross-modal retrieval framework with aleatoric-epistemic joint uncertainty modeling (AEUM). AEUM is committed to providing reliable uncertainty estimation for both data (aleatoric uncertainty, AU) and model (epistemic uncertainty, EU), which are then used to correct the initial cross-modal similarity to yield more accurate retrieval results. Specifically, for AU, we introduce learnable semantic tokens for each modality to estimate the data-induced uncertainty in another modality, offering guidance on data complexity or ambiguity. For the EU, we leverage the efficient evidential learning paradigm to estimate model-induced uncertainty and incorporate it into the model's predictions, thereby enhancing robustness against noisy data. Extensive experiments demonstrate the effectiveness and generalization of our method on multiple cross-modal retrieval benchmarks, including five video-text retrieval datasets (MSRVTT, LSMDC, MSVD, VATEX, and DiDeMo) and two image-text retrieval datasets (MSCOCO and Flickr30K). Our code is publicly available at https://github.com/cty8998/AEUM.
Citation
T. Chang, P. Song, X. Yang, D. Guo, X. Chang, "Aleatoric-Epistemic Joint Uncertainty Modeling for Cross-Modal Retrieval," IEEE Transactions on Cybernetics, vol. PP, no. 99, pp. 1-14, 2026, https://doi.org/10.1109/tcyb.2026.3664380.
Source
IEEE Transactions on Cybernetics
Conference
Keywords
46 Information and Computing Sciences, 4603 Computer Vision and Multimedia Computation
Subjects
Source
Publisher
IEEE
Full-text link