XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
Khan, Salman ; Noor, Sumaiya ; Javed, Tahir ; Naseem, Afshan ; Aslam, Fahad ; AlQahtani, Salman A. ; Ahmad, Nijad
Khan, Salman
Noor, Sumaiya
Javed, Tahir
Naseem, Afshan
Aslam, Fahad
AlQahtani, Salman A.
Ahmad, Nijad
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences—plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson’s and Alzheimer’s. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model’s reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.
Citation
S. Khan et al., “XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites,” BioData Min, vol. 18, no. 1, pp. 1–18, Dec. 2025, doi: 10.1186/S13040-024-00415-8/TABLES/8.
Source
BioData Mining
Conference
Keywords
Pseudo position-specific score matrix, Sumoylation, Post-translation modification, XGBoost, SHAP
Subjects
Source
Publisher
Springer Nature
