Item

PedCLIP: A Vision-Language Model for Pediatric X-Rays with Mixture of Body Part Experts

Huy, Ta Duc
Shoby, Abin
Tran, Sen
Xie, Yutong
Chen, Qi
Nguyen, Phi Le
Gole, Akshay
Liu, Lingqiao
Perperidis, Antonios
Friswell, Mark
... show 7 more
Research Projects
Organizational Units
Journal Issue
Abstract
Vision-language models have demonstrated remarkable success in general medical image analysis, yet their application in pediatric imaging remains significantly underexplored. These models show limited performance on pediatric datasets, primarily due to domain gaps stemming from anatomical differences, lower radiation doses, and pediatric-specific diseases. To this end, we present the first pediatric vision-language pre-training framework, dubbed PedCLIP, trained on a comprehensive pediatric imaging dataset comprising 404,670 X-rays of pediatric patients across diverse anatomical regions. To address anatomical diversity, we introduce a Mixture of Body part Experts design, with each expert specializing in learning features from distinct anatomical regions. Experimental evaluation across eleven downstream tasks demonstrates that our model significantly outperforms current state-of-the-art vision-language models, achieving superior diagnostic accuracy in challenging pediatric conditions, including rare diseases such as pediatric inflammatory arthritis. Code is available: https://github.com/tadeephuy/PedCLIP
Citation
T. D. Huy et al., “PedCLIP: A Vision-Language Model for Pediatric X-Rays with Mixture of Body Part Experts,” pp. 487–497, 2026, doi: 10.1007/978-3-032-04971-1_46
Source
Proceedings of the Medical Image Computing and Computer Assisted Intervention
Conference
28 International Conference on Medical Image Computing and Computer-Assisted Intervention MICCAI
Keywords
Pediatric, Mixture-of-experts, Vision-language-model
Subjects
Source
28 International Conference on Medical Image Computing and Computer-Assisted Intervention MICCAI
Publisher
Springer Nature
Full-text link