Item

GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning

Raju, S. M.Taslim Uddin
Milon Islam, Md
Haque, Md Rezwanul
Altaheri, Hamdi
Karray, Fakhry O.
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSIs face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these challenges, a novel GNN-ViTCap framework is introduced for classification and caption generation from histopathological microscopic images. A visual feature extractor is used to extract feature embeddings. The redundant patches are then removed by dynamically clustering images using deep embedded clustering and extracting representative images through a scalar dot attention mechanism. The graph is formed by constructing edges from the similarity matrix, connecting each node to its nearest neighbors. Therefore, a graph neural network is utilized to extract and represent contextual information from both local and global areas. The aggregated image embeddings are then projected into the language model's input space using a linear layer and combined with input caption tokens to fine-tune the large language models for caption generation. Our proposed method is validated using the BreakHis and PatchGastric microscopic datasets. The GNN-ViTCap method achieves an F1-Score of 0.934 and AUC of 0.963 for classification, along with BLEU@4 = 0.811 and METEOR = 0.569 for captioning. Experimental analysis demonstrates that the GNN-ViTCap architecture outperforms state-of-the-art (SOTA) approaches, providing a reliable and efficient approach for patient diagnosis using microscopy images.
Citation
S. M. Taslim Uddin Raju, M. Milon Islam, M. R. Haque, H. Altaheri and F. Karray, "GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning," 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 2025, pp. 1-9, doi: 10.1109/IJCNN64981.2025.11228324.
Source
Proceedings of the International Joint Conference on Neural Networks (IJCNN)
Conference
2025 International Joint Conference on Neural Networks, IJCNN 2025
Keywords
Deep Embedded Clustering, Graph-Based Aggregation, Image Captioning, Large Language Models, Microscopic WSI, Vision Transformer
Subjects
Source
2025 International Joint Conference on Neural Networks, IJCNN 2025
Publisher
IEEE
Full-text link