xTrimoPGLM: unified 100-billion-parameter pretrained transformer for deciphering the language of proteins
Chen, Bo ; Cheng, Xingyi ; Li, Pan ; Geng, Yangliao ; Gong, Jing ; Li, Shen ; Bei, Zhilei ; Tan, Xu ; Wang, Boyan ; Zeng, Xin ... show 5 more
Chen, Bo
Cheng, Xingyi
Li, Pan
Geng, Yangliao
Gong, Jing
Li, Shen
Bei, Zhilei
Tan, Xu
Wang, Boyan
Zeng, Xin
Supervisor
Department
Machine Learning
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pretraining objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pretraining framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that (1) xTrimoPGLM substantially outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced three-dimensional structural prediction model that surpasses existing language model-based tools. (2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science. Trained weight for the xTrimoPGLM model, and downstream datasets are available at https://huggingface.co/biomap-research.
Citation
B. Chen et al., “xTrimoPGLM: unified 100-billion-parameter pretrained transformer for deciphering the language of proteins,” Nature Methods 2025, pp. 1–12, Apr. 2025, doi: 10.1038/s41592-025-02636-z.
Source
Nature Methods
Conference
Keywords
xTrimoPGLM, Protein language model, 100-billion parameters, Protein sequence generation
Subjects
Source
Publisher
Springer Nature
