Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 22
Filter
Add more filters










Publication year range
1.
Methods Mol Biol ; 2760: 319-344, 2024.
Article in English | MEDLINE | ID: mdl-38468097

ABSTRACT

We briefly present machine learning approaches for designing better biological experiments. These approaches build on machine learning predictors and provide additional tools to guide scientific discovery. There are two different kinds of objectives when designing better experiments: to improve the predictive model or to improve the experimental outcome. We survey five different approaches for adaptive experimental design that iteratively search the space of possible experiments while adapting to measured data. The approaches are Bayesian optimization, bandits, reinforcement learning, optimal experimental design, and active learning. These machine learning approaches have shown promise in various areas of biology, and we provide broad guidelines to the practitioner and links to further resources.


Subject(s)
Machine Learning , Research Design , Bayes Theorem
3.
ACS Synth Biol ; 11(7): 2314-2326, 2022 07 15.
Article in English | MEDLINE | ID: mdl-35704784

ABSTRACT

Optimization of gene expression levels is an essential part of the organism design process. Fine control of this process can be achieved by engineering transcription and translation control elements, including the ribosome binding site (RBS). Unfortunately, the design of specific genetic parts remains challenging because of the lack of reliable design methods. To address this problem, we have created a machine learning guided Design-Build-Test-Learn (DBTL) cycle for the experimental design of bacterial RBSs to demonstrate how small genetic parts can be reliably designed using relatively small, high-quality data sets. We used Gaussian Process Regression for the Learn phase of the cycle and the Upper Confidence Bound multiarmed bandit algorithm for the Design of genetic variants to be tested in vivo. We have integrated these machine learning algorithms with laboratory automation and high-throughput processes for reliable data generation. Notably, by Testing a total of 450 RBS variants in four DBTL cycles, we have experimentally validated RBSs with high translation initiation rates equaling or exceeding our benchmark RBS by up to 34%. Overall, our results show that machine learning is a powerful tool for designing RBSs, and they pave the way toward more complicated genetic devices.


Subject(s)
Machine Learning , Ribosomes , Algorithms , Binding Sites , Ribosomes/genetics , Ribosomes/metabolism
4.
Genetics ; 215(1): 25-40, 2020 05.
Article in English | MEDLINE | ID: mdl-32193188

ABSTRACT

There is increasing interest in developing diagnostics that discriminate individual mutagenic mechanisms in a range of applications that include identifying population-specific mutagenesis and resolving distinct mutation signatures in cancer samples. Analyses for these applications assume that mutagenic mechanisms have a distinct relationship with neighboring bases that allows them to be distinguished. Direct support for this assumption is limited to a small number of simple cases, e.g., CpG hypermutability. We have evaluated whether the mechanistic origin of a point mutation can be resolved using only sequence context for a more complicated case. We contrasted single nucleotide variants originating from the multitude of mutagenic processes that normally operate in the mouse germline with those induced by the potent mutagen N-ethyl-N-nitrosourea (ENU). The considerable overlap in the mutation spectra of these two samples make this a challenging problem. Employing a new, robust log-linear modeling method, we demonstrate that neighboring bases contain information regarding point mutation direction that differs between the ENU-induced and spontaneous mutation variant classes. A logistic regression classifier exhibited strong performance at discriminating between the different mutation classes. Concordance between the feature set of the best classifier and information content analyses suggest our results can be generalized to other mutation classification problems. We conclude that machine learning can be used to build a practical classification tool to identify the mutation mechanism for individual genetic variants. Software implementing our approach is freely available under an open-source license.


Subject(s)
Machine Learning , Point Mutation , Sequence Analysis, DNA/methods , Animals , Ethylnitrosourea/toxicity , Mice , Mutagens/toxicity , Nucleotide Motifs
5.
Nat Commun ; 11(1): 730, 2020 02 05.
Article in English | MEDLINE | ID: mdl-32024845

ABSTRACT

We present SVclone, a computational method for inferring the cancer cell fraction of structural variant (SV) breakpoints from whole-genome sequencing data. SVclone accurately determines the variant allele frequencies of both SV breakends, then simultaneously estimates the cancer cell fraction and SV copy number. We assess performance using in silico mixtures of real samples, at known proportions, created from two clonal metastases from the same patient. We find that SVclone's performance is comparable to single-nucleotide variant-based methods, despite having an order of magnitude fewer data points. As part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium, which aggregated whole-genome sequencing data from 2658 cancers across 38 tumour types, we use SVclone to reveal a subset of liver, ovarian and pancreatic cancers with subclonally enriched copy-number neutral rearrangements that show decreased overall survival. SVclone enables improved characterisation of SV intra-tumour heterogeneity.


Subject(s)
Computational Biology/methods , Neoplasms/genetics , Neoplasms/pathology , Algorithms , Computer Simulation , DNA Copy Number Variations , Female , Gene Frequency , Genome, Human , Humans , Liver Neoplasms/genetics , Liver Neoplasms/pathology , Male , Ovarian Neoplasms/genetics , Ovarian Neoplasms/pathology , Pancreatic Neoplasms/genetics , Pancreatic Neoplasms/pathology , Prostatic Neoplasms/genetics , Prostatic Neoplasms/pathology , Sensitivity and Specificity , Whole Genome Sequencing
6.
PeerJ Comput Sci ; 4: e157, 2018.
Article in English | MEDLINE | ID: mdl-33816810

ABSTRACT

We study the problem of combining active learning suggestions to identify informative training examples by empirically comparing methods on benchmark datasets. Many active learning heuristics for classification problems have been proposed to help us pick which instance to annotate next. But what is the optimal heuristic for a particular source of data? Motivated by the success of methods that combine predictors, we combine active learners with bandit algorithms and rank aggregation methods. We demonstrate that a combination of active learners outperforms passive learning in large benchmark datasets and removes the need to pick a particular active learner a priori. We discuss challenges to finding good rewards for bandit approaches and show that rank aggregation performs well.

7.
PLoS Comput Biol ; 13(9): e1005727, 2017 Sep.
Article in English | MEDLINE | ID: mdl-28873405

ABSTRACT

Modern genomics techniques generate overwhelming quantities of data. Extracting population genetic variation demands computationally efficient methods to determine genetic relatedness between individuals (or "samples") in an unbiased manner, preferably de novo. Rapid estimation of genetic relatedness directly from sequencing data has the potential to overcome reference genome bias, and to verify that individuals belong to the correct genetic lineage before conclusions are drawn using mislabelled, or misidentified samples. We present the k-mer Weighted Inner Product (kWIP), an assembly-, and alignment-free estimator of genetic similarity. kWIP combines a probabilistic data structure with a novel metric, the weighted inner product (WIP), to efficiently calculate pairwise similarity between sequencing runs from their k-mer counts. It produces a distance matrix, which can then be further analysed and visualised. Our method does not require prior knowledge of the underlying genomes and applications include establishing sample identity and detecting mix-up, non-obvious genomic variation, and population structure. We show that kWIP can reconstruct the true relatedness between samples from simulated populations. By re-analysing several published datasets we show that our results are consistent with marker-based analyses. kWIP is written in C++, licensed under the GNU GPL, and is available from https://github.com/kdmurray91/kwip.


Subject(s)
Genetic Variation/genetics , Genetics, Population/methods , Genomics/methods , Software , Algorithms , Chlamydomonas/genetics , Models, Genetic , Models, Statistical , Sequence Analysis, DNA
8.
Bioinformatics ; 32(12): 1840-7, 2016 06 15.
Article in English | MEDLINE | ID: mdl-26873928

ABSTRACT

MOTIVATION: Understanding the occurrence and regulation of alternative splicing (AS) is a key task towards explaining the regulatory processes that shape the complex transcriptomes of higher eukaryotes. With the advent of high-throughput sequencing of RNA (RNA-Seq), the diversity of AS transcripts could be measured at an unprecedented depth. Although the catalog of known AS events has grown ever since, novel transcripts are commonly observed when working with less well annotated organisms, in the context of disease, or within large populations. Whereas an identification of complete transcripts is technically challenging and computationally expensive, focusing on single splicing events as a proxy for transcriptome characteristics is fruitful and sufficient for a wide range of analyses. RESULTS: We present SplAdder, an alternative splicing toolbox, that takes RNA-Seq alignments and an annotation file as input to (i) augment the annotation based on RNA-Seq evidence, (ii) identify alternative splicing events present in the augmented annotation graph, (iii) quantify and confirm these events based on the RNA-Seq data and (iv) test for significant quantitative differences between samples. Thereby, our main focus lies on performance, accuracy and usability. AVAILABILITY: Source code and documentation are available for download at http://github.com/ratschlab/spladder Example data, introductory information and a small tutorial are accessible via http://bioweb.me/spladder CONTACTS: : andre.kahles@ratschlab.org or gunnar.ratsch@ratschlab.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Alternative Splicing , Gene Expression Profiling , RNA , Sequence Analysis, RNA , Transcriptome
9.
IEEE Trans Pattern Anal Mach Intell ; 38(6): 1204-16, 2016 06.
Article in English | MEDLINE | ID: mdl-26372207

ABSTRACT

This paper presents a theoretical foundation for an SVM solver in Krein spaces. Up to now, all methods are based either on the matrix correction, or on non-convex minimization, or on feature-space embedding. Here we justify and evaluate a solution that uses the original (indefinite) similarity measure, in the original Krein space. This solution is the result of a stabilization procedure. We establish the correspondence between the stabilization problem (which has to be solved) and a classical SVM based on minimization (which is easy to solve). We provide simple equations to go from one to the other (in both directions). This link between stabilization and minimization problems is the key to obtain a solution in the original Krein space. Using KSVM, one can solve SVM with usually troublesome kernels (large negative eigenvalues or large numbers of negative eigenvalues). We show experiments showing that our algorithm KSVM outperforms all previously proposed approaches to deal with indefinite matrices in SVM-like kernel methods.

10.
BMC Genomics ; 15: 1117, 2014 Dec 16.
Article in English | MEDLINE | ID: mdl-25516378

ABSTRACT

BACKGROUND: Alternative splicing is an essential mechanism for increasing transcriptome and proteome diversity in eukaryotes. Particularly in multicellular eukaryotes, this mechanism is involved in the regulation of developmental and physiological processes like growth, differentiation and signal transduction. RESULTS: Here we report the genome-wide analysis of alternative splicing in the multicellular green alga Volvox carteri. The bioinformatic analysis of 132,038 expressed sequence tags (ESTs) identified 580 alternative splicing events in a total of 426 genes. The predominant type of alternative splicing in Volvox is intron retention (46.5%) followed by alternative 5' (17.9%) and 3' (21.9%) splice sites and exon skipping (9.5%). Our analysis shows that in Volvox at least ~2.9% of the intron-containing genes are subject to alternative splicing. Considering the total number of sequenced ESTs, the Volvox genome seems to provide more favorable conditions (e.g., regarding length and GC content of introns) for the occurrence of alternative splicing than the genome of its close unicellular relative Chlamydomonas. Moreover, many randomly chosen alternatively spliced genes of Volvox do not show alternative splicing in Chlamydomonas. Since the Volvox genome contains about the same number of protein-coding genes as the Chlamydomonas genome (~14,500 protein-coding genes), we assumed that alternative splicing may play a key role in generation of genomic diversity, which is required to evolve from a simple one-cell ancestor to a multicellular organism with differentiated cell types (Mol Biol Evol 31:1402-1413, 2014). To confirm the alternative splicing events identified by bioinformatic analysis, several genes with different types of alternatively splicing have been selected followed by experimental verification of the predicted splice variants by RT-PCR. CONCLUSIONS: The results show that our approach for prediction of alternative splicing events in Volvox was accurate and reliable. Moreover, quantitative real-time RT-PCR appears to be useful in Volvox for analyses of relationships between the appearance of specific alternative splicing variants and different kinds of physiological, metabolic and developmental processes as well as responses to environmental changes.


Subject(s)
Alternative Splicing , Genomics , Volvox/genetics , Chromosome Mapping , Exons/genetics , Expressed Sequence Tags/metabolism , Genome, Plant/genetics , Introns/genetics , RNA Splice Sites/genetics
11.
PeerJ ; 2: e639, 2014.
Article in English | MEDLINE | ID: mdl-25374782

ABSTRACT

We present a method to assist in interpretation of the functional impact of intergenic disease-associated SNPs that is not limited to search strategies proximal to the SNP. The method builds on two sources of external knowledge: the growing understanding of three-dimensional spatial relationships in the genome, and the substantial repository of information about relationships among genetic variants, genes, and diseases captured in the published biomedical literature. We integrate chromatin conformation capture data (HiC) with literature support to rank putative target genes of intergenic disease-associated SNPs. We demonstrate that this hybrid method outperforms a genomic distance baseline on a small test set of expression quantitative trait loci, as well as either method individually. In addition, we show the potential for this method to uncover relationships between intergenic SNPs and target genes across chromosomes. With more extensive chromatin conformation capture data becoming readily available, this method provides a way forward towards functional interpretation of SNPs in the context of the three dimensional structure of the genome in the nucleus.

12.
PLoS One ; 9(4): e93319, 2014.
Article in English | MEDLINE | ID: mdl-24787002

ABSTRACT

Given the difficulty and effort required to confirm candidate causal SNPs detected in genome-wide association studies (GWAS), there is no practical way to definitively filter false positives. Recent advances in algorithmics and statistics have enabled repeated exhaustive search for bivariate features in a practical amount of time using standard computational resources, allowing us to use cross-validation to evaluate the stability. We performed 10 trials of 2-fold cross-validation of exhaustive bivariate analysis on seven Wellcome-Trust Case-Control Consortium GWAS datasets, comparing the traditional [Formula: see text] test for association, the high-performance GBOOST method and the recently proposed GSS statistic (Available at http://bioinformatics.research.nicta.com.au/software/gwis/). We use Spearman's correlation to measure the similarity between the folds of cross validation. To compare incomplete lists of ranks we propose an extension to Spearman's correlation. The extension allows us to consider a natural threshold for feature selection where the correlation is zero. This is the first reported cross-validation study of exhaustive bivariate GWAS feature selection. We found that stability between ranked lists from different cross-validation folds was higher for GSS in the majority of diseases. A thorough analysis of the correlation between SNP-frequency and univariate [Formula: see text] score demonstrated that the [Formula: see text] test for association is highly confounded by main effects: SNPs with high univariate significance replicably dominate the ranked results. We show that removal of the univariately significant SNPs improves [Formula: see text] replicability but risks filtering pairs involving SNPs with univariate effects. We empirically confirm that the stability of GSS and GBOOST were not affected by removal of univariately significant SNPs. These results suggest that the GSS and GBOOST tests are successfully targeting bivariate association with phenotype and that GSS is able to reliably detect a larger set of SNP-pairs than GBOOST in the majority of the data we analysed. However, the [Formula: see text] test for association was confounded by main effects.


Subject(s)
Biomarkers/analysis , Genome-Wide Association Study , Case-Control Studies , Humans , Reproducibility of Results
13.
Bioinformatics ; 29(20): 2625-32, 2013 Oct 15.
Article in English | MEDLINE | ID: mdl-23900189

ABSTRACT

MOTIVATION: Biological systems are understood through iterations of modeling and experimentation. Not all experiments, however, are equally valuable for predictive modeling. This study introduces an efficient method for experimental design aimed at selecting dynamical models from data. Motivated by biological applications, the method enables the design of crucial experiments: it determines a highly informative selection of measurement readouts and time points. RESULTS: We demonstrate formal guarantees of design efficiency on the basis of previous results. By reducing our task to the setting of graphical models, we prove that the method finds a near-optimal design selection with a polynomial number of evaluations. Moreover, the method exhibits the best polynomial-complexity constant approximation factor, unless P = NP. We measure the performance of the method in comparison with established alternatives, such as ensemble non-centrality, on example models of different complexity. Efficient design accelerates the loop between modeling and experimentation: it enables the inference of complex mechanisms, such as those controlling central metabolic operation. AVAILABILITY: Toolbox 'NearOED' available with source code under GPL on the Machine Learning Open Source Software Web site (mloss.org).


Subject(s)
Research Design , Systems Biology/methods , Animals , Models, Theoretical , Probability , Signal Transduction , Software , TOR Serine-Threonine Kinases/metabolism
14.
BMC Genomics ; 14 Suppl 3: S10, 2013.
Article in English | MEDLINE | ID: mdl-23819779

ABSTRACT

BACKGROUND: It has been hypothesized that multivariate analysis and systematic detection of epistatic interactions between explanatory genotyping variables may help resolve the problem of "missing heritability" currently observed in genome-wide association studies (GWAS). However, even the simplest bivariate analysis is still held back by significant statistical and computational challenges that are often addressed by reducing the set of analysed markers. Theoretically, it has been shown that combinations of loci may exist that show weak or no effects individually, but show significant (even complete) explanatory power over phenotype when combined. Reducing the set of analysed SNPs before bivariate analysis could easily omit such critical loci. RESULTS: We have developed an exhaustive bivariate GWAS analysis methodology that yields a manageable subset of candidate marker pairs for subsequent analysis using other, often more computationally expensive techniques. Our model-free filtering approach is based on classification using ROC curve analysis, an alternative to much slower regression-based modelling techniques. Exhaustive analysis of studies containing approximately 450,000 SNPs and 5,000 samples requires only 2 hours using a desktop CPU or 13 minutes using a GPU (Graphics Processing Unit). We validate our methodology with analysis of simulated datasets as well as the seven Wellcome Trust Case-Control Consortium datasets that represent a wide range of real life GWAS challenges. We have identified SNP pairs that have considerably stronger association with disease than their individual component SNPs that often show negligible effect univariately. When compared against previously reported results in the literature, our methods re-detect most significant SNP-pairs and additionally detect many pairs absent from the literature that show strong association with disease. The high overlap suggests that our fast analysis could substitute for some slower alternatives. CONCLUSIONS: We demonstrate that the proposed methodology is robust, fast and capable of exhaustive search for epistatic interactions using a standard desktop computer. First, our implementation is significantly faster than timings for comparable algorithms reported in the literature, especially as our method allows simultaneous use of multiple statistical filters with low computing time overhead. Second, for some diseases, we have identified hundreds of SNP pairs that pass formal multiple test (Bonferroni) correction and could form a rich source of hypotheses for follow-up analysis. AVAILABILITY: A web-based version of the software used for this analysis is available at http://bioinformatics.research.nicta.com.au/gwis.


Subject(s)
Algorithms , Computational Biology/methods , Epistasis, Genetic/genetics , Genome-Wide Association Study/methods , Models, Genetic , Polymorphism, Single Nucleotide/genetics , Software , Computer Simulation , Humans , ROC Curve , Sensitivity and Specificity , Time Factors
15.
J Pathol Inform ; 4(Suppl): S2, 2013.
Article in English | MEDLINE | ID: mdl-23766938

ABSTRACT

BACKGROUND: Histological tissue analysis often involves manual cell counting and staining estimation of cancerous cells. These assessments are extremely time consuming, highly subjective and prone to error, since immunohistochemically stained cancer tissues usually show high variability in cell sizes, morphological structures and staining quality. To facilitate reproducible analysis in clinical practice as well as for cancer research, objective computer assisted staining estimation is highly desirable. METHODS: We employ machine learning algorithms as randomized decision trees and support vector machines for nucleus detection and classification. Superpixels as segmentation over the tissue image are classified into foreground and background and thereafter into malignant and benign, learning from the user's feedback. As a fast alternative without nucleus classification, the existing color deconvolution method is incorporated. RESULTS: Our program TMARKER connects already available workflows for computational pathology and immunohistochemical tissue rating with modern active learning algorithms from machine learning and computer vision. On a test dataset of human renal clear cell carcinoma and prostate carcinoma, the performance of the used algorithms is equivalent to two independent pathologists for nucleus detection and classification. CONCLUSION: We present a novel, free and operating system independent software package for computational cell counting and staining estimation, supporting IHC stained tissue analysis in clinic and for research. Proprietary toolboxes for similar tasks are expensive, bound to specific commercial hardware (e.g. a microscope) and mostly not quantitatively validated in terms of performance and reproducibility. We are confident that the presented software package will proof valuable for the scientific community and we anticipate a broader application domain due to the possibility to interactively learn models for new image types.

16.
PLoS Comput Biol ; 7(6): e1002079, 2011 Jun.
Article in English | MEDLINE | ID: mdl-21731479

ABSTRACT

Decoding models, such as those underlying multivariate classification algorithms, have been increasingly used to infer cognitive or clinical brain states from measures of brain activity obtained by functional magnetic resonance imaging (fMRI). The practicality of current classifiers, however, is restricted by two major challenges. First, due to the high data dimensionality and low sample size, algorithms struggle to separate informative from uninformative features, resulting in poor generalization performance. Second, popular discriminative methods such as support vector machines (SVMs) rarely afford mechanistic interpretability. In this paper, we address these issues by proposing a novel generative-embedding approach that incorporates neurobiologically interpretable generative models into discriminative classifiers. Our approach extends previous work on trial-by-trial classification for electrophysiological recordings to subject-by-subject classification for fMRI and offers two key advantages over conventional methods: it may provide more accurate predictions by exploiting discriminative information encoded in 'hidden' physiological quantities such as synaptic connection strengths; and it affords mechanistic interpretability of clinical classifications. Here, we introduce generative embedding for fMRI using a combination of dynamic causal models (DCMs) and SVMs. We propose a general procedure of DCM-based generative embedding for subject-wise classification, provide a concrete implementation, and suggest good-practice guidelines for unbiased application of generative embedding in the context of fMRI. We illustrate the utility of our approach by a clinical example in which we classify moderately aphasic patients and healthy controls using a DCM of thalamo-temporal regions during speech processing. Generative embedding achieves a near-perfect balanced classification accuracy of 98% and significantly outperforms conventional activation-based and correlation-based methods. This example demonstrates how disease states can be detected with very high accuracy and, at the same time, be interpreted mechanistically in terms of abnormalities in connectivity. We envisage that future applications of generative embedding may provide crucial advances in dissecting spectrum disorders into physiologically more well-defined subgroups.


Subject(s)
Algorithms , Aphasia/physiopathology , Brain/physiopathology , Computational Biology/methods , Magnetic Resonance Imaging , Adult , Aged , Bayes Theorem , Brain/pathology , Databases, Factual , Humans , Male , Middle Aged , Models, Neurological , Nervous System Diseases/diagnosis , Nervous System Diseases/physiopathology , Pattern Recognition, Automated , Principal Component Analysis , Reproducibility of Results , Speech Perception
17.
Neuroimage ; 56(2): 601-15, 2011 May 15.
Article in English | MEDLINE | ID: mdl-20406688

ABSTRACT

Conventional decoding methods in neuroscience aim to predict discrete brain states from multivariate correlates of neural activity. This approach faces two important challenges. First, a small number of examples are typically represented by a much larger number of features, making it hard to select the few informative features that allow for accurate predictions. Second, accuracy estimates and information maps often remain descriptive and can be hard to interpret. In this paper, we propose a model-based decoding approach that addresses both challenges from a new angle. Our method involves (i) inverting a dynamic causal model of neurophysiological data in a trial-by-trial fashion; (ii) training and testing a discriminative classifier on a strongly reduced feature space derived from trial-wise estimates of the model parameters; and (iii) reconstructing the separating hyperplane. Since the approach is model-based, it provides a principled dimensionality reduction of the feature space; in addition, if the model is neurobiologically plausible, decoding results may offer a mechanistically meaningful interpretation. The proposed method can be used in conjunction with a variety of modelling approaches and brain data, and supports decoding of either trial or subject labels. Moreover, it can supplement evidence-based approaches for model-based decoding and enable structural model selection in cases where Bayesian model selection cannot be applied. Here, we illustrate its application using dynamic causal modelling (DCM) of electrophysiological recordings in rodents. We demonstrate that the approach achieves significant above-chance performance and, at the same time, allows for a neurobiological interpretation of the results.


Subject(s)
Brain/physiology , Computer Simulation , Models, Neurological , Animals , Rats
18.
Genome Res ; 19(11): 2133-43, 2009 Nov.
Article in English | MEDLINE | ID: mdl-19564452

ABSTRACT

We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that approximately 2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.


Subject(s)
Algorithms , Caenorhabditis elegans/genetics , Computational Biology/methods , Genome, Helminth/genetics , Animals , Artificial Intelligence , Caenorhabditis/classification , Caenorhabditis/genetics , Genes, Helminth/genetics , Genomics/methods , RNA Splice Sites , Reproducibility of Results , Reverse Transcriptase Polymerase Chain Reaction , Sequence Analysis, DNA , Transcription Initiation Site
19.
Nucleic Acids Res ; 37(Web Server issue): W312-6, 2009 Jul.
Article in English | MEDLINE | ID: mdl-19494180

ABSTRACT

We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp).


Subject(s)
Genes , Genomics , Proteins/genetics , Software , Internet , RNA Splice Sites , Sequence Analysis, DNA , Transcription Initiation Site
SELECTION OF CITATIONS
SEARCH DETAIL
...