Loading...
Multi-modality guided cross-attention for visual question answering
Khan, Muhammad Zeeshan ; Nguyen, Duc Thanh ; Nguyen, Thanh Thi ; Gaddam, Anuroop ; Razzak, Imran
Khan, Muhammad Zeeshan
Nguyen, Duc Thanh
Nguyen, Thanh Thi
Gaddam, Anuroop
Razzak, Imran
Files
Loading...
s11042-025-21049-w.pdf
Adobe PDF, 1.87 MB
Supervisor
Department
Computational Biology
Embargo End Date
Type
Journal article
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Visual Question Answering (VQA) is a multimodality research domain that intersects the fields of computer vision and natural language processing for visual-textual data processing and understanding. Traditional VQA methods extract visual and textual features from pre-trained architectures, respectively, then combine the features from both modalities in a common feature space. The traditional methods perform well on high-level perception questions. However, attaining high accuracy on low-level perception questions still remains challenging. The difficulties include detecting relevant visual and textual information, building meaningful associations, and extracting insights from the multimodal data. To address these challenges, unlike existing approaches, we propose a novel multi-modality guided cross self-attention mechanism for building semantic relationships within individual modalities as well as between them. Specifically, we examine visual-guided cross-attention (VGCA), textual-guided cross-attention (TGCA), and multi-modality-guided cross-attention (MMGCA). We utilise convolutional neural networks (CNNs) for visual feature learning, and LSTM and FNET for textual feature learning. We evaluate our method on two benchmark datasets, including VQA 1.0 and VQA 2.0. Experimental results demonstrate the superiority of our method over existing baselines by improving the performance on various types of questions in both datasets.
Citation
M.Z. Khan, D.T. Nguyen, T.T. Nguyen, A. Gaddam, I. Razzak, "Multi-modality guided cross-attention for visual question answering," Multimedia Tools and Applications, vol. 84, no. 39, pp. 47543-47565, 2025, https://doi.org/10.1007/s11042-025-21049-w.
Source
Multimedia Tools and Applications
Conference
Keywords
46 Information and Computing Sciences, 4611 Machine Learning
Subjects
Source
Publisher
Springer Nature
