LLaVA-Endo: a large language-and-vision assistant for gastrointestinal endoscopy
Yao, Jieru ; Li, Xueran ; Xie, Qiang ; Han, Longfei ; Jia, Yiwen ; Liu, Nian ; Zhang, Dingwen ; Han, Junwei
Yao, Jieru
Li, Xueran
Xie, Qiang
Han, Longfei
Jia, Yiwen
Liu, Nian
Zhang, Dingwen
Han, Junwei
Supervisor
Department
Computer Vision
Embargo End Date
Type
Letter
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We introduce LLaVA-Endo, a large language and vision model designed for the field of GI endoscopy. Specifically, we generate a high-quality dataset for GI endoscopic medical language-image instruction tuning and introduce an innovative progressive transfer learning technique to fine-tune LLaVA. Experimental results show that LLaVA-Endo demonstrates powerful domain expertise and conversational capabilities, outperforming previous SoTA multimodal methods in the field of GI endoscopy data. In future, we intend to collect more data for training and evaluation, and integrate more functionalities such as report generation, and polyp segmentation. © Higher Education Press 2025.
Citation
J. Yao et al., “LLaVA-Endo: a large language-and-vision assistant for gastrointestinal endoscopy,” Front Comput Sci, vol. 19, no. 4, p. 194331, Apr. 2025, doi: 10.1007/s11704-024-40319-8.
Source
Frontiers of Computer Science
Conference
Keywords
Subjects
Source
Publisher
Springer Nature
