Loading...
Human–large language model collaboration in clinical medicine: a systematic review and meta-analysis
Wang, Guoyong ; Zhang, Kaijun ; Jiang, Jiyue ; Wang, Chaonan ; Bi, Hui ; Liang, Haojun ; Qi, Zuoliang ; Huang, Ying ; Li, Yu ; Yang, Xiaonan
Wang, Guoyong
Zhang, Kaijun
Jiang, Jiyue
Wang, Chaonan
Bi, Hui
Liang, Haojun
Qi, Zuoliang
Huang, Ying
Li, Yu
Yang, Xiaonan
Files
Loading...
s41746-026-02382-2.pdf
Adobe PDF, 1.59 MB
Supervisor
Department
Computational Biology
Embargo End Date
Type
Journal article
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Human–AI collaboration (H + AI) using large language models (LLMs) offers a promising approach to enhance clinical reasoning, documentation, and interpretation tasks. Following PRISMA 2020 (PROSPERO registration: CRD420251068272), we systematically compared H + AI with human-only (H) workflows, searching four databases through June 28, 2025. Ten peer-reviewed studies met eligibility criteria, with three preprints informing sensitivity analyses only. Diagnostic/interpretation accuracy (k = 2) showed a positive trend for H + AI (Risk Ratio [RR] 1.59), but was statistically imprecise and non-significant (95% CI 0.08 to 32.74), with 95% prediction intervals (PI) crossing the null. Composite diagnostic/management scores (k = 2) showed a statistically significant improvement (Mean Difference [MD] +4.88 percentage points, 95% CI + 0.65 to +9.12), yet the PI (–31.65 to 41.42) indicates high real-world uncertainty. Time efficiency (k = 3) showed no overall difference (MD + 0.4 min, 95%CI −4.18 to +4.97; I² = 70.1%). While documentation quality improved, but factual error rates remained high (~26–36%), undermining quality gains. In three-arm settings, H + AI did not universally outperform AI-only. Evidence remains preliminary yet highly uncertain and context-dependent. We recommend preregistered, pragmatic, multicenter trials embedded in real workflows, with harmonized core outcomes that prioritize safety/error metrics and interfaces that surface uncertainty and support verification.
Citation
G. Wang, K. Zhang, J. Jiang, C. Wang, H. Bi, H. Liang , et al., "Human–large language model collaboration in clinical medicine: a systematic review and meta-analysis," npj Digital Medicine, vol. 9, no. 1, pp. 195-195, 2026, https://doi.org/10.1038/s41746-026-02382-2.
Source
npj Digital Medicine
Conference
Keywords
42 Health Sciences, 4203 Health Services and Systems, 3 Good Health and Well Being
Subjects
Source
Publisher
Springer Nature
