TOP-Training: Target-Oriented Pretraining for Medical Extractive Question Answering
Sengupta, Saptarshi ; Heaton, Connor ; Ghosh, Shreya ; Yin, Wenpeng ; Nakov, Preslav ; Wang, Suhang
Sengupta, Saptarshi
Heaton, Connor
Ghosh, Shreya
Yin, Wenpeng
Nakov, Preslav
Wang, Suhang
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We study extractive question-answering in the medical domain (Medical-EQA). This problem has two main challenges: (i) domain specificity, as most AI models lack necessary domain knowledge, and (ii) extraction-based answering style, which restricts most autoregressive LLMs due to potential hallucinations. To handle those challenges, we propose TOP-Training, a target-oriented pretraining paradigm that stands out among all domain adaptation techniques with two desirable features: (i) TOP-Training moves one step further than popular domain-oriented fine-tuning since it not only moves closer to the target domain, but also familiarizes itself with the target dataset, and (ii) it does not assume the existence of a large set of unlabeled instances from the target domain. Specifically, for a target Medical-EQA dataset, we extract its entities and leverage large language models (LLMs) to generate synthetic texts containing those entities; we then demonstrate that pretraining on this synthetic text data yields better performance on the target Medical-EQA benchmarks. Overall, our contributions are threefold: (i) TOP-Training, a new pretraining technique to effectively adapt LLMs to better solve a target problem, (ii) TOP-Training has a wide application scope because it does not require the target problem to have a large set of unlabeled data, and (iii) our experiments highlight the limitations of autoregressive LLMs, emphasizing TOP-Training as a means to unlock the true potential of bidirectional LLMs.
Citation
S. Sengupta, C. Heaton, S. Ghosh, W. Yin, P. Nakov, and S. Wang, “TOP-Training: Target-Oriented Pretraining for Medical Extractive Question Answering,” 2025. Accessed: Apr. 03, 2025. [Online]. Available: https://aclanthology.org/2025.coling-main.469/
Source
Proceedings - International Conference on Computational Linguistics, COLING
Conference
Keywords
Target-Oriented Pretraining (TOP-Training), Medical Extractive Question Answering (Medical-EQA), Large Language Models (LLMs), Synthetic data generation, Domain adaptation
Subjects
Source
Publisher
Association for Computational Linguistics
