Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters










Database
Language
Publication year range
1.
Bioinformatics ; 34(3): 398-406, 2018 02 01.
Article in English | MEDLINE | ID: mdl-29028927

ABSTRACT

Motivation: A clear identification of the primary site of tumor is of great importance to the next targeted site-specific treatments and could efficiently improve patient's overall survival. Even though many classifiers based on gene expression had been proposed to predict the tumor primary, only a few studies focus on using DNA methylation (DNAm) profiles to develop classifiers, and none of them compares the performance of classifiers based on different profiles. Results: We introduced novel selection strategies to identify highly tissue-specific CpG sites and then used the random forest approach to construct the classifiers to predict the origin of tumors. We also compared the prediction performance by applying similar strategy on miRNA expression profiles. Our analysis indicated that these classifiers had an accuracy of 96.05% (Maximum-Relevance-Maximum-Distance: 90.02-99.99%) or 95.31% (principal component analysis: 79.82-99.91%) on independent DNAm datasets, and an overall accuracy of 91.30% (range 79.33-98.74%) on independent miRNA test sets for predicting tumor origin. This suggests that our feature selection methods are very effective to identify tissue-specific biomarkers and the classifiers we developed can efficiently predict the origin of tumors. We also developed a user-friendly webserver that helps users to predict the tumor origin by uploading miRNA expression or DNAm profile of their interests. Availability and implementation: The webserver, and relative data, code are accessible at http://server.malab.cn/MMCOP/. Contact: zouquan@nclab.net or a.teschendorff@ucl.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology/methods , DNA Methylation , Genes, Neoplasm , MicroRNAs/genetics , Neoplasms/diagnosis , CpG Islands , DNA, Neoplasm , Female , Gene Expression Profiling/methods , Gene Expression Regulation, Neoplastic , Humans , Male , Neoplasms/genetics , Sequence Analysis, DNA/methods , Sequence Analysis, RNA/methods
2.
Algorithms Mol Biol ; 12: 25, 2017.
Article in English | MEDLINE | ID: mdl-29026435

ABSTRACT

BACKGROUND: Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. METHODS: Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. RESULTS: The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. CONCLUSIONS: THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.

3.
Proteomics ; 17(17-18)2017 Sep.
Article in English | MEDLINE | ID: mdl-28776938

ABSTRACT

Predicting the subcellular localization of proteins is an important and challenging problem. Traditional experimental approaches are often expensive and time-consuming. Consequently, a growing number of research efforts employ a series of machine learning approaches to predict the subcellular location of proteins. There are two main challenges among the state-of-the-art prediction methods. First, most of the existing techniques are designed to deal with multi-class rather than multi-label classification, which ignores connections between multiple labels. In reality, multiple locations of particular proteins imply that there are vital and unique biological significances that deserve special focus and cannot be ignored. Second, techniques for handling imbalanced data in multi-label classification problems are necessary, but never employed. For solving these two issues, we have developed an ensemble multi-label classifier called HPSLPred, which can be applied for multi-label classification with an imbalanced protein source. For convenience, a user-friendly webserver has been established at http://server.malab.cn/HPSLPred.


Subject(s)
Computational Biology/methods , Machine Learning , Proteins/classification , Proteins/metabolism , Databases, Protein , Humans , Intracellular Space , Protein Transport , Subcellular Fractions
4.
Artif Intell Med ; 83: 82-90, 2017 Nov.
Article in English | MEDLINE | ID: mdl-28245947

ABSTRACT

Selective ensemble learning is a technique that selects a subset of diverse and accurate basic models in order to generate stronger generalization ability. In this paper, we proposed a novel learning algorithm that is based on parallel optimization and hierarchical selection (PTHS). Our novel feature selection method is based on maximize the sum of relevance and distance (MSRD) for solving the problem of high dimensionality. Specifically, we have a PTHS algorithm that employs parallel optimization and candidate model pruning based on k-means and a hierarchical selection framework. We combine the prediction result of each basic model by majority voting, which employs the divide-and-conquer strategy to save computing time. In addition, the PT algorithm is capable to transform a multi-class problem into a binary classification problem, and thereby allowing our ensemble model to address multi-class problems. Empirical study shows that MSRD is efficient in solving the high dimensionality problem, and PTHS exhibits better performance than the other existing classification algorithms. Most importantly, our classifier achieved high-level performance on several bioinformatics problems (e.g. tRNA identification, and protein-protein interaction prediction, etc.), demonstrating efficiency and robustness.


Subject(s)
Computational Biology/methods , Data Mining/methods , Machine Learning , Proteins/classification , RNA, Transfer/classification , Area Under Curve , Databases, Genetic , Protein Interaction Maps , Proteins/metabolism , RNA, Transfer/genetics , RNA, Transfer/metabolism , ROC Curve , Reproducibility of Results
5.
BMC Syst Biol ; 11(Suppl 6): 100, 2017 12 14.
Article in English | MEDLINE | ID: mdl-29297337

ABSTRACT

BACKGROUND: Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. RESULTS: HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. CONCLUSIONS: In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .


Subject(s)
Evolution, Molecular , Phylogeny , Sequence Analysis, DNA/methods , Software , Algorithms , Classification/methods , Computational Biology/methods , Sequence Alignment/methods
6.
BMC Syst Biol ; 10(Suppl 4): 114, 2016 Dec 23.
Article in English | MEDLINE | ID: mdl-28155714

ABSTRACT

BACKGROUND: It is necessary and essential to discovery protein function from the novel primary sequences. Wet lab experimental procedures are not only time-consuming, but also costly, so predicting protein structure and function reliably based only on amino acid sequence has significant value. TATA-binding protein (TBP) is a kind of DNA binding protein, which plays a key role in the transcription regulation. Our study proposed an automatic approach for identifying TATA-binding proteins efficiently, accurately, and conveniently. This method would guide for the special protein identification with computational intelligence strategies. RESULTS: Firstly, we proposed novel fingerprint features for TBP based on pseudo amino acid composition, physicochemical properties, and secondary structure. Secondly, hierarchical features dimensionality reduction strategies were employed to improve the performance furthermore. Currently, Pretata achieves 92.92% TATA-binding protein prediction accuracy, which is better than all other existing methods. CONCLUSIONS: The experiments demonstrate that our method could greatly improve the prediction accuracy and speed, thus allowing large-scale NGS data prediction to be practical. A web server is developed to facilitate the other researchers, which can be accessed at http://server.malab.cn/preTata/ .


Subject(s)
Computational Biology/methods , TATA-Box Binding Protein/metabolism , Amino Acid Sequence , Chemical Phenomena , Protein Binding , Protein Structure, Secondary , Software , Support Vector Machine , TATA-Box Binding Protein/chemistry
SELECTION OF CITATIONS
SEARCH DETAIL
...