Search | VHL Regional Portal

Decoding Wilson disease: a machine learning approach to predict neurological symptoms.

Yang, Yulong; Wang, Gang-Ao; Fang, Shuzhen; Li, Xiang; Ding, Yufeng; Song, Yuqi; He, Wei; Rao, Zhihong; Diao, Ke; Zhu, Xiaolei; Yang, Wenming.

Front Neurol ; 15: 1418474, 2024.

Article in English | MEDLINE | ID: mdl-38966086

ABSTRACT

Objectives: Wilson disease (WD) is a rare autosomal recessive disorder caused by a mutation in the ATP7B gene. Neurological symptoms are one of the most common symptoms of WD. This study aims to construct a model that can predict the occurrence of neurological symptoms by combining clinical multidimensional indicators with machine learning methods. Methods: The study population consisted of WD patients who received treatment at the First Affiliated Hospital of Anhui University of Traditional Chinese Medicine from July 2021 to September 2023 and had a Leipzig score ≥ 4 points. Indicators such as general clinical information, imaging, blood and urine tests, and clinical scale measurements were collected from patients, and machine learning methods were employed to construct a prediction model for neurological symptoms. Additionally, the SHAP method was utilized to analyze clinical information to determine which indicators are associated with neurological symptoms. Results: In this study, 185 patients with WD (of whom 163 had neurological symptoms) were analyzed. It was found that using the eXtreme Gradient Boosting (XGB) to predict achieved good performance, with an MCC value of 0.556, ACC value of 0.929, AUROC value of 0.835, and AUPRC value of 0.975. Brainstem damage, blood creatinine (Cr), age, indirect bilirubin (IBIL), and ceruloplasmin (CP) were the top five important predictors. Meanwhile, the presence of brainstem damage and the higher the values of Cr, Age, and IBIL, the more likely neurological symptoms were to occur, while the lower the CP value, the more likely neurological symptoms were to occur. Conclusions: To sum up, the prediction model constructed using machine learning methods to predict WD cirrhosis has high accuracy. The most important indicators in the prediction model were brainstem damage, Cr, age, IBIL, and CP. It provides assistance for clinical decision-making.

MSTL-Kace: Prediction of Prokaryotic Lysine Acetylation Sites Based on Multistage Transfer Learning Strategy.

Wang, Gang-Ao; Yan, Xiaodi; Li, Xiang; Liu, Yinbo; Xia, Junfeng; Zhu, Xiaolei.

ACS Omega ; 8(44): 41930-41942, 2023 Nov 07.

Article in English | MEDLINE | ID: mdl-37969991

ABSTRACT

As one of the most important post-translational modifications (PTM), lysine acetylation (Kace) plays an important role in various biological activities. Traditional experimental methods for identifying Kace sites are inefficient and expensive. Instead, several machine learning methods have been developed for Kace site prediction, and hand-crafted features have been used to encode the protein sequences. However, there are still two challenges: the complex biological information may be under-represented by these manmade features and the small sample issue of some species needs to be addressed. We propose a novel model, MSTL-Kace, which was developed based on transfer learning strategy with pretrained bidirectional encoder representations from transformers (BERT) model. In this model, the high-level embeddings were extracted from species-specific BERT models, and a two-stage fine-tuning strategy was used to deal with small sample issue. Specifically, a domain-specific BERT model was pretrained using all of the sequences in our data sets, which was then fine-tuned, or two-stage fine-tuned based on the training data set of each species to obtain the species-specific BERT models. Afterward, the embeddings of residues were extracted from the fine-tuned model and fed to the different downstream learning algorithms. After comparison, the best model for the six prokaryotic species was built by using a random forest. The results for the independent test sets show that our model outperforms the state-of-the-art methods on all six species. The source codes and data for MSTL-Kace are available at https://github.com/leo97king/MSTL-Kace.

Protein-DNA interface hotspots prediction based on fusion features of embeddings of protein language model and handcrafted features.

Li, Xiang; Wang, Gang-Ao; Wei, Zhuoyu; Wang, Hong; Zhu, Xiaolei.

Comput Biol Chem ; 107: 107970, 2023 Dec.

Article in English | MEDLINE | ID: mdl-37866116

ABSTRACT

The identification of hotspot residues at the protein-DNA binding interfaces plays a crucial role in various aspects such as drug discovery and disease treatment. Although experimental methods such as alanine scanning mutagenesis have been developed to determine the hotspot residues on protein-DNA interfaces, they are both inefficient and costly. Therefore, it is highly necessary to develop efficient and accurate computational methods for predicting hotspot residues. Several computational methods have been developed, however, they are mainly based on hand-crafted features which may not be able to represent all the information of proteins. In this regard, we propose a model called PDH-EH, which utilizes fused features of embeddings extracted from a protein language model (PLM) and handcrafted features. After we extracted the total 1141 dimensional features, we used mRMR to select the optimal feature subset. Based on the optimal feature subset, several different learning algorithms such as Random Forest, Support Vector Machine, and XGBoost were used to build the models. The cross-validation results on the training dataset show that the model built by using Random Forest achieves the highest AUROC. Further evaluation on the independent test set shows that our model outperforms the existing state-of-the-art models. Moreover, the effectiveness and interpretability of embeddings extracted from PLM were demonstrated in our analysis. The codes and datasets used in this study are available at: https://github.com/lixiangli01/PDH-EH.

Subject(s)

Algorithms , Proteins , Databases, Protein , Proteins/chemistry , Protein Binding , DNA/chemistry

BERT-Kgly: A Bidirectional Encoder Representations From Transformers (BERT)-Based Model for Predicting Lysine Glycation Site for Homo sapiens.

Liu, Yinbo; Liu, Yufeng; Wang, Gang-Ao; Cheng, Yinchu; Bi, Shoudong; Zhu, Xiaolei.

Front Bioinform ; 2: 834153, 2022.

Article in English | MEDLINE | ID: mdl-36304324

ABSTRACT

As one of the most important posttranslational modifications (PTMs), protein lysine glycation changes the characteristics of the proteins and leads to the dysfunction of the proteins, which may cause diseases. Accurately detecting the glycation sites is of great benefit for understanding the biological function and potential mechanism of glycation in the treatment of diseases. However, experimental methods are expensive and time-consuming for lysine glycation site identification. Instead, computational methods, with their higher efficiency and lower cost, could be an important supplement to the experimental methods. In this study, we proposed a novel predictor, BERT-Kgly, for protein lysine glycation site prediction, which was developed by extracting embedding features of protein segments from pretrained Bidirectional Encoder Representations from Transformers (BERT) models. Three pretrained BERT models were explored to get the embeddings with optimal representability, and three downstream deep networks were employed to build our models. Our results showed that the model based on embeddings extracted from the BERT model pretrained on 556,603 protein sequences of UniProt outperforms other models. In addition, an independent test set was used to evaluate and compare our model with other existing methods, which indicated that our model was superior to other existing models.

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL