Msagpt: Neural prompting protein structure prediction via msa generative pre-training
Chen, Bo ; Bei, Zhilei ; Cheng, Xingyi ; Li, Pan ; Tang, Jie ; Song, Le
Chen, Bo
Bei, Zhilei
Cheng, Xingyi
Li, Pan
Tang, Jie
Song, Le
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high-quality MSA. Although various methods have been proposed to generate high-quality MSA under these conditions, they fall short in comprehensively capturing the intricate co-evolutionary patterns within MSA or require guidance from external oracle models. Here we introduce MSAGPT, a novel approach to prompt protein structure predictions via MSA generative pre-training in a low-MSA regime. MSAGPT employs a simple yet effective 2D evolutionary positional encoding scheme to model the complex evolutionary patterns. Endowed by this, the flexible 1D MSA decoding framework facilitates zero- or few-shot learning. Moreover, we demonstrate leveraging the feedback from AlphaFold2 (AF2) can further enhance the model’s capacity via Rejective Fine-tuning (RFT) and Reinforcement Learning from AF2 Feedback (RLAF). Extensive experiments confirm the efficacy of MSAGPT in generating faithful and informative MSA (up to +8.5% TM-Score on few-shot scenarios). The transfer learning also demonstrates its great potential for the wide range of tasks resorting to the quality of MSA.
Citation
B. Chen, Z. Bei, X. Cheng, P. Li, J. Tang, and L. Song, “MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training,” Adv Neural Inf Process Syst, vol. 37, pp. 37504–37534, Dec. 2024, Accessed: Mar. 24, 2025. [Online]. Available: https://github.com/THUDM/MSAGPT
Source
Advances in Neural Information Processing Systems (NeurIPS 2024)
Conference
Keywords
Protein structure prediction, Multiple Sequence Alignment (MSA), Low-MSA regime, Generative pre-training, Neural prompting
Subjects
Source
Publisher
NEURIPS
