Search | VHL Regional Portal

Fuzzy kernel evidence Random Forest for identifying pseudouridine sites.

Chen, Mingshuai; Sun, Mingai; Su, Xi; Tiwari, Prayag; Ding, Yijie.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38622357

ABSTRACT

Pseudouridine is an RNA modification that is widely distributed in both prokaryotes and eukaryotes, and plays a critical role in numerous biological activities. Despite its importance, the precise identification of pseudouridine sites through experimental approaches poses significant challenges, requiring substantial time and resources.Therefore, there is a growing need for computational techniques that can reliably and quickly identify pseudouridine sites from vast amounts of RNA sequencing data. In this study, we propose fuzzy kernel evidence Random Forest (FKeERF) to identify pseudouridine sites. This method is called PseU-FKeERF, which demonstrates high accuracy in identifying pseudouridine sites from RNA sequencing data. The PseU-FKeERF model selected four RNA feature coding schemes with relatively good performance for feature combination, and then input them into the newly proposed FKeERF method for category prediction. FKeERF not only uses fuzzy logic to expand the original feature space, but also combines kernel methods that are easy to interpret in general for category prediction. Both cross-validation tests and independent tests on benchmark datasets have shown that PseU-FKeERF has better predictive performance than several state-of-the-art methods. This new method not only improves the accuracy of pseudouridine site identification, but also provides a certain reference for disease control and related drug development in the future.

Subject(s)

Pseudouridine , Random Forest , Pseudouridine/genetics , RNA/genetics , Base Sequence

Sequence-Based Prediction with Feature Representation Learning and Biological Function Analysis of Channel Proteins.

Chen, Zheng; Jiao, Shihu; Zhao, Da; Hesham, Abd El-Latif; Zou, Quan; Xu, Lei; Sun, Mingai; Zhang, Lijun.

Front Biosci (Landmark Ed) ; 27(6): 177, 2022 06 02.

Article in English | MEDLINE | ID: mdl-35748253

ABSTRACT

BACKGROUND: Channel proteins are proteins that can transport molecules past the plasma membrane through free diffusion movement. Due to the cost of labor and experimental methods, developing a tool to identify channel proteins is necessary for biological research on channel proteins. METHODS: 17 feature coding methods and four machine learning classifiers to generate 68-dimensional data probability features. Then, the two-step feature selection strategy was used to optimize the features, and the final prediction Model M16-LGBM (light gradient boosting machine) was obtained on the 16-dimensional optimal feature vector. RESULTS: A new predictor, CAPs-LGBM, was proposed to identify the channel proteins effectively. CONCLUSIONS: CAPs-LGBM is the first channel protein machine learning predictor was used to construct the final prediction model based on protein primary sequences. The classifier performed well in the training and test sets.

Subject(s)

Computational Biology , Proteins , Algorithms , Amino Acid Sequence , Computational Biology/methods , Machine Learning , Support Vector Machine

Identification of Gingivitis-Related Genes Across Human Tissues Based on the Summary Mendelian Randomization.

Zhang, Jiahui; Sun, Mingai; Zhao, Yuanyuan; Geng, Guannan; Hu, Yang.

Front Cell Dev Biol ; 8: 624766, 2020.

Article in English | MEDLINE | ID: mdl-34026747

ABSTRACT

Periodontal diseases are among the most frequent inflammatory diseases affecting children and adolescents, which affect the supporting structures of the teeth and lead to tooth loss and contribute to systemic inflammation. Gingivitis is the most common periodontal infection. Gingivitis, which is mainly caused by a substance produced by microbial plaque, systemic disorders, and genetic abnormalities in the host. Identifying gingivitis-related genes across human tissues is not only significant for understanding disease mechanisms but also disease development and clinical diagnosis. The Genome-wide association study (GWAS) a commonly used method to mine disease-related genetic variants. However, due to some factors such as linkage disequilibrium, it is difficult for GWAS to identify genes directly related to the disease. Hence, we constructed a data integration method that uses the Summary Mendelian randomization (SMR) to combine the GWAS with expression quantitative trait locus (eQTL) data to identify gingivitis-related genes. Five eQTL studies from different human tissues and one GWAS studies were referenced in this paper. This study identified several candidates SNPs and genes relate to gingivitis in tissue-specific or cross-tissue. Further, we also analyzed and explained the functions of these genes. The R program for the SMR method has been uploaded to GitHub(https://github.com/hxdde/SMR).

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL