Search | VHL Regional Portal

A Bayesian model for identifying cancer subtypes from paired methylation profiles.

Fan, Yetian; S Chan, April; Zhu, Jun; Yi Leung, Suet; Fan, Xiaodan.

Brief Bioinform ; 24(1)2023 01 19.

Article in English | MEDLINE | ID: mdl-36575828

ABSTRACT

Aberrant DNA methylation is the most common molecular lesion that is crucial for the occurrence and development of cancer, but has thus far been underappreciated as a clinical tool for cancer classification, diagnosis or as a guide for therapeutic decisions. Partly, this has been due to a lack of proven algorithms that can use methylation data to stratify patients into clinically relevant risk groups and subtypes that are of prognostic importance. Here, we proposed a novel Bayesian model to capture the methylation signatures of different subtypes from paired normal and tumor methylation array data. Application of our model to synthetic and empirical data showed high clustering accuracy, and was able to identify the possible epigenetic cause of a cancer subtype.

Subject(s)

DNA Methylation , Neoplasms , Humans , Bayes Theorem , Neoplasms/genetics

HLPI-Ensemble: Prediction of human lncRNA-protein interactions based on ensemble strategy.

Hu, Huan; Zhang, Li; Ai, Haixin; Zhang, Hui; Fan, Yetian; Zhao, Qi; Liu, Hongsheng.

RNA Biol ; 15(6): 797-806, 2018.

Article in English | MEDLINE | ID: mdl-29583068

ABSTRACT

LncRNA plays an important role in many biological and disease progression by binding to related proteins. However, the experimental methods for studying lncRNA-protein interactions are time-consuming and expensive. Although there are a few models designed to predict the interactions of ncRNA-protein, they all have some common drawbacks that limit their predictive performance. In this study, we present a model called HLPI-Ensemble designed specifically for human lncRNA-protein interactions. HLPI-Ensemble adopts the ensemble strategy based on three mainstream machine learning algorithms of Support Vector Machines (SVM), Random Forests (RF) and Extreme Gradient Boosting (XGB) to generate HLPI-SVM Ensemble, HLPI-RF Ensemble and HLPI-XGB Ensemble, respectively. The results of 10-fold cross-validation show that HLPI-SVM Ensemble, HLPI-RF Ensemble and HLPI-XGB Ensemble achieved AUCs of 0.95, 0.96 and 0.96, respectively, in the test dataset. Furthermore, we compared the performance of the HLPI-Ensemble models with the previous models through external validation dataset. The results show that the false positives (FPs) of HLPI-Ensemble models are much lower than that of the previous models, and other evaluation indicators of HLPI-Ensemble models are also higher than those of the previous models. It is further showed that HLPI-Ensemble models are superior in predicting human lncRNA-protein interaction compared with previous models. The HLPI-Ensemble is publicly available at: http://ccsipb.lnu.edu.cn/hlpiensemble/ .

Subject(s)

Databases, Nucleic Acid , Models, Biological , RNA, Long Noncoding , RNA-Binding Proteins , Sequence Analysis, RNA/methods , Support Vector Machine , Humans , RNA, Long Noncoding/genetics , RNA, Long Noncoding/metabolism , RNA-Binding Proteins/genetics , RNA-Binding Proteins/metabolism

Prediction of essential proteins based on subcellular localization and gene expression correlation.

Fan, Yetian; Tang, Xiwei; Hu, Xiaohua; Wu, Wei; Ping, Qing.

BMC Bioinformatics ; 18(Suppl 13): 470, 2017 Dec 01.

Article in English | MEDLINE | ID: mdl-29219067

ABSTRACT

BACKGROUND: Essential proteins are indispensable to the survival and development process of living organisms. To understand the functional mechanisms of essential proteins, which can be applied to the analysis of disease and design of drugs, it is important to identify essential proteins from a set of proteins first. As traditional experimental methods designed to test out essential proteins are usually expensive and laborious, computational methods, which utilize biological and topological features of proteins, have attracted more attention in recent years. Protein-protein interaction networks, together with other biological data, have been explored to improve the performance of essential protein prediction. RESULTS: The proposed method SCP is evaluated on Saccharomyces cerevisiae datasets and compared with five other methods. The results show that our method SCP outperforms the other five methods in terms of accuracy of essential protein prediction. CONCLUSIONS: In this paper, we propose a novel algorithm named SCP, which combines the ranking by a modified PageRank algorithm based on subcellular compartments information, with the ranking by Pearson correlation coefficient (PCC) calculated from gene expression data. Experiments show that subcellular localization information is promising in boosting essential protein prediction.

Subject(s)

Algorithms , Computational Biology/methods , Gene Expression Regulation, Fungal , Genes, Essential , Protein Interaction Maps , Saccharomyces cerevisiae Proteins/metabolism , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae Proteins/genetics , Subcellular Fractions

Predicting diabetes mellitus genes via protein-protein interaction and protein subcellular localization information.

Tang, Xiwei; Hu, Xiaohua; Yang, Xuejun; Fan, Yetian; Li, Yongfan; Hu, Wei; Liao, Yongzhong; Zheng, Ming Cai; Peng, Wei; Gao, Li.

BMC Genomics ; 17 Suppl 4: 433, 2016 08 18.

Article in English | MEDLINE | ID: mdl-27535125

ABSTRACT

BACKGROUND: Diabetes mellitus characterized by hyperglycemia as a result of insufficient production of or reduced sensitivity to insulin poses a growing threat to the health of people. It is a heterogeneous disorder with multiple etiologies consisting of type 1 diabetes, type 2 diabetes, gestational diabetes and so on. Diabetes-associated protein/gene prediction is a key step to understand the cellular mechanisms related to diabetes mellitus. Compared with experimental methods, computational predictions of candidate proteins/genes are cheaper and more effortless. Protein-protein interaction (PPI) data produced by the high-throughput technology have been used to prioritize candidate disease genes/proteins. However, the false interactions in the PPI data seriously hurt computational methods performance. In order to address that particular question, new methods are developed to identify candidate disease genes/proteins via integrating biological data from other sources. RESULTS: In this study, a new framework called PDMG is proposed to predict candidate disease genes/proteins. First, the weighted networks are building in terms of the combination of the subcellular localization information and PPI data. To form the weighted networks, the importance of each compartment is evaluated based on the number of interacted proteins in this compartment. This is because the very different roles played by different compartments in cell activities. Besides, some compartments are more important than others. Based on the evaluated compartments, the interactions between proteins are scored and the weighted PPI networks are constructed. Second, the known disease genes are extracted from OMIM database as the seed genes to expand disease-specific networks based on the weighted networks. Third, the weighted values between a protein and its neighbors in the disease-related networks are added together and the sum is as the score of the protein. Last but not least, the proteins are ranked based on descending order of their scores. The candidate proteins in the top are considered to be associated with the diseases and are potential disease-related proteins. Various types of data, such as type 2 diabetes-associated genes, subcellular localizations and protein interactions, are used to test PDMG method. CONCLUSIONS: The results show that the proteins/genes functionally exerting a direct influence over diabetes are consistently placed at the head of the queue. PDMG expands and ranks 445 candidate proteins from the seed set including original 27 type 2 diabetes proteins. Out of the top 27 proteins, 14 proteins are the real type 2 diabetes proteins. The literature extracted from the PubMed database has proved that, out of 13 novel proteins, 8 proteins are associated with diabetes.

Subject(s)

Computational Biology/methods , Diabetes Mellitus, Type 2/genetics , Protein Interaction Mapping/methods , Protein Interaction Maps/genetics , Algorithms , Humans , Proteins/genetics , Proteins/metabolism , Software

An Algorithm for Motif Discovery with Iteration on Lengths of Motifs.

Fan, Yetian; Wu, Wei; Yang, Jie; Yang, Wenyu; Liu, Rongrong.

IEEE/ACM Trans Comput Biol Bioinform ; 12(1): 136-41, 2015.

Article in English | MEDLINE | ID: mdl-26357084

ABSTRACT

Analysis of DNA sequence motifs is becoming increasingly important in the study of gene regulation, and the identification of motif in DNA sequences is a complex problem in computational biology. Motif discovery has attracted the attention of more and more researchers, and varieties of algorithms have been proposed. Most existing motif discovery algorithms fix the motif's length as one of the input parameters. In this paper, a novel method is proposed to identify the optimal length of the motif and the optimal motif with that length, through an iteration process on increasing length numbers. For each fixed length, a modified genetic algorithm (GA) is used for finding the optimal motif with that length. Three operators are used in the modified GA: Mutation that is similar to the one used in usual GA but is modified to avoid local optimum in our case, and Addition and Deletion that are proposed by us for the problem. A criterion is given for singling out the optimal length in the increasing motif's lengths. We call this method AMDILM (an algorithm for motif discovery with iteration on lengths of motifs). The experiments on simulated data and real biological data show that AMDILM can accurately identify the optimal motif length. Meanwhile, the optimal motifs discovered by AMDILM are consistent with the real ones and are similar with the motifs obtained by the three well-known methods: Gibbs Sampler, MEME and Weeder.

Subject(s)

Algorithms , Computational Biology/methods , Nucleotide Motifs/genetics , Sequence Analysis, DNA/methods , Mutation/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL