Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 3 de 3
Filter
Add more filters










Database
Language
Publication year range
1.
Genome Res ; 19(11): 2133-43, 2009 Nov.
Article in English | MEDLINE | ID: mdl-19564452

ABSTRACT

We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that approximately 2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.


Subject(s)
Algorithms , Caenorhabditis elegans/genetics , Computational Biology/methods , Genome, Helminth/genetics , Animals , Artificial Intelligence , Caenorhabditis/classification , Caenorhabditis/genetics , Genes, Helminth/genetics , Genomics/methods , RNA Splice Sites , Reproducibility of Results , Reverse Transcriptase Polymerase Chain Reaction , Sequence Analysis, DNA , Transcription Initiation Site
2.
Bioinformatics ; 24(13): i6-14, 2008 Jul 01.
Article in English | MEDLINE | ID: mdl-18586746

ABSTRACT

MOTIVATION: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts. RESULTS: To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena. AVAILABILITY: All source code, datasets, tables and figures are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Artificial Intelligence , DNA/genetics , Pattern Recognition, Automated/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Base Sequence , Molecular Sequence Data
3.
BMC Bioinformatics ; 8 Suppl 10: S7, 2007.
Article in English | MEDLINE | ID: mdl-18269701

ABSTRACT

BACKGROUND: For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks. RESULTS: In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder. AVAILABILITY: Data, splits, additional information on the model selection, the whole genome predictions, as well as the stand-alone prediction tool are available for download at http://www.fml.mpg.de/raetsch/projects/splice.


Subject(s)
RNA Splice Sites/genetics , Algorithms , Animals , Arabidopsis/genetics , Brassicaceae/genetics , Caenorhabditis elegans/genetics , Drosophila melanogaster/genetics , Forecasting/methods , Genomics/methods , Humans , Markov Chains , Zebrafish/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...