Unifying Protein Function Prediction via Text Matching
Li, Xun
Li, Xun
Author
Supervisor
Department
Machine Learning
Embargo End Date
2024-01-01
Type
Thesis
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The vast increase in publicly available protein sequences has popularized the pretraining then finetuning paradigm for predicting protein functions. This approach, however, relies on the availability of annotated protein data specific to each prediction task, which can be a significant limitation due to the necessity for extensive finetuning for different protein functions. To overcome these challenges, this thesis propose a novel method termed Protein prediction via Text Matching (ProTeM), which seeks to simplify and unify the protein function prediction process. This method converts numeric or categorical labels from various protein function datasets into textual instructions, incorporating rich semantic information. It utilizes both Large Language Models (LLMs) and Protein Language Models (PLMs), where LLMs are employed for their superior language understanding capabilities to discern connections among protein functions, thereby facilitating a more effective alignment between textual instructions and protein sequences.
Citation
X. Li, "Unifying Protein Function Prediction via Text Matching", MS. Thesis, Machine Learning, MBZUAI, Abu Dhabi, UAE, 2024
