Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
Bioinformatics ; 37(18): 2803-2810, 2021 09 29.
Article in English | MEDLINE | ID: mdl-33822891

ABSTRACT

MOTIVATION: Metagenomic approaches hold the potential to characterize microbial communities and unravel the intricate link between the microbiome and biological processes. Assembly is one of the most critical steps in metagenomics experiments. It consists of transforming overlapping DNA sequencing reads into sufficiently accurate representations of the community's genomes. This process is computationally difficult and commonly results in genomes fragmented across many contigs. Computational binning methods are used to mitigate fragmentation by partitioning contigs based on their sequence composition, abundance or chromosome organization into bins representing the community's genomes. Existing binning methods have been principally tuned for bacterial genomes and do not perform favorably on viral metagenomes. RESULTS: We propose Composition and Coverage Network (CoCoNet), a new binning method for viral metagenomes that leverages the flexibility and the effectiveness of deep learning to model the co-occurrence of contigs belonging to the same viral genome and provide a rigorous framework for binning viral contigs. Our results show that CoCoNet substantially outperforms existing binning methods on viral datasets. AVAILABILITY AND IMPLEMENTATION: CoCoNet was implemented in Python and is available for download on PyPi (https://pypi.org/). The source code is hosted on GitHub at https://github.com/Puumanamana/CoCoNet and the documentation is available at https://coconet.readthedocs.io/en/latest/index.html. CoCoNet does not require extensive resources to run. For example, binning 100k contigs took about 4 h on 10 Intel CPU Cores (2.4 GHz), with a memory peak at 27 GB (see Supplementary Fig. S9). To process a large dataset, CoCoNet may need to be run on a high RAM capacity server. Such servers are typically available in high-performance or cloud computing settings. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Deep Learning , Microbiota , Metagenome , Algorithms , Software , Microbiota/genetics , Sequence Analysis, DNA/methods , Metagenomics/methods
2.
Int J Mol Sci ; 16(1): 1466-81, 2015 Jan 08.
Article in English | MEDLINE | ID: mdl-25580537

ABSTRACT

The discovery of novel microRNA (miRNA) and piwi-interacting RNA (piRNA) is an important task for the understanding of many biological processes. Most of the available miRNA and piRNA identification methods are dependent on the availability of the organism's genome sequence and the quality of its annotation. Therefore, an efficient prediction method based solely on the short RNA reads and requiring no genomic information is highly desirable. In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction. Using an empirical Bayesian kernel method and the error correcting output codes framework, compact models suitable for large-scale analyses are built on databases of known mature miRNAs and piRNAs. We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel. Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set. Using experimental data, we also show that our approach can detect about an order of magnitude or more known miRNAs than the mature miRNA predictor, miRPlex.


Subject(s)
MicroRNAs/metabolism , RNA, Small Interfering/metabolism , Animals , Caenorhabditis elegans/genetics , Databases, Genetic , Drosophila melanogaster/genetics , Genome , MicroRNAs/genetics , Normal Distribution , RNA, Small Interfering/genetics , ROC Curve , Support Vector Machine
3.
mBio ; 5(3): e01210-14, 2014 Jun 17.
Article in English | MEDLINE | ID: mdl-24939887

ABSTRACT

UNLABELLED: Viruses have a profound influence on the ecology and evolution of plankton, but our understanding of the composition of the aquatic viral communities is still rudimentary. This is especially true of those viruses having RNA genomes. The limited data that have been published suggest that the RNA virioplankton is dominated by viruses with positive-sense, single-stranded (+ss) genomes that have features in common with those of eukaryote-infecting viruses in the order Picornavirales (picornavirads). In this study, we investigated the diversity of the RNA virus assemblages in tropical coastal seawater samples using targeted PCR and metagenomics. Amplification of RNA-dependent RNA polymerase (RdRp) genes from fractions of a buoyant density gradient suggested that the distribution of two major subclades of the marine picornavirads was largely congruent with the distribution of total virus-like RNA, a finding consistent with their proposed dominance. Analyses of the RdRp sequences in the library revealed the presence of many diverse phylotypes, most of which were related only distantly to those of cultivated viruses. Phylogenetic analysis suggests that there were hundreds of unique picornavirad-like phylotypes in one 35-liter sample that differed from one another by at least as much as the differences among currently recognized species. Assembly of the sequences in the metagenome resulted in the reconstruction of six essentially complete viral genomes that had features similar to viruses in the families Bacillarna-, Dicistro-, and Marnaviridae. Comparison of the tropical seawater metagenomes with those from other habitats suggests that +ssRNA viruses are generally the most common types of RNA viruses in aquatic environments, but biases in library preparation remain a possible explanation for this observation. IMPORTANCE: Marine plankton account for much of the photosynthesis and respiration on our planet, and they influence the cycling of carbon and the distribution of nutrients on a global scale. Despite the fundamental importance of viruses to plankton ecology and evolution, most of the viruses in the sea, and the identities of their hosts, are unknown. This report is one of very few that delves into the genetic diversity within RNA-containing viruses in the ocean. The data expand the known range of viral diversity and shed new light on the physical properties and genetic composition of RNA viruses in the ocean.


Subject(s)
RNA Viruses/classification , RNA Viruses/isolation & purification , Seawater/virology , Ecosystem , Genetic Variation , Genome, Viral , Metagenomics , Molecular Sequence Data , Phylogeny , Polymerase Chain Reaction , RNA Viruses/genetics , Tropical Climate
4.
BMC Genomics ; 14 Suppl 2: S6, 2013.
Article in English | MEDLINE | ID: mdl-23445533

ABSTRACT

BACKGROUND: Classification is the problem of assigning each input object to one of a finite number of classes. This problem has been extensively studied in machine learning and statistics, and there are numerous applications to bioinformatics as well as many other fields. Building a multiclass classifier has been a challenge, where the direct approach of altering the binary classification algorithm to accommodate more than two classes can be computationally too expensive. Hence the indirect approach of using binary decomposition has been commonly used, in which retrieving the class posterior probabilities from the set of binary posterior probabilities given by the individual binary classifiers has been a major issue. METHODS: In this work, we present an extension of a recently introduced probabilistic kernel-based learning algorithm called the Classification Relevance Units Machine (CRUM) to the multiclass setting to increase its applicability. The extension is achieved under the error correcting output codes framework. The probabilistic outputs of the binary CRUMs are preserved using a proposed linear-time decoding algorithm, an alternative to the generalized Bradley-Terry (GBT) algorithm whose application to large-scale prediction settings is prohibited by its computational complexity. The resulting classifier is called the Multiclass Relevance Units Machine (McRUM). RESULTS: The evaluation of McRUM on a variety of real small-scale benchmark datasets shows that our proposed Naïve decoding algorithm is computationally more efficient than the GBT algorithm while maintaining a similar level of predictive accuracy. Then a set of experiments on a larger scale dataset for small ncRNA classification have been conducted with Naïve McRUM and compared with the Gaussian and linear SVM. Although McRUM's predictive performance is slightly lower than the Gaussian SVM, the results show that the similar level of true positive rate can be achieved by sacrificing false positive rate slightly. Furthermore, McRUM is computationally more efficient than the SVM, which is an important factor for large-scale analysis. CONCLUSIONS: We have proposed McRUM, a multiclass extension of binary CRUM. McRUM with Naïve decoding algorithm is computationally efficient in run-time and its predictive performance is comparable to the well-known SVM, showing its potential in solving large-scale multiclass problems in bioinformatics and other fields of study.


Subject(s)
Algorithms , Computational Biology/methods , RNA, Untranslated/classification
5.
ISME J ; 7(3): 672-9, 2013 Mar.
Article in English | MEDLINE | ID: mdl-23151645

ABSTRACT

Viruses are abundant in the ocean and a major driving force in plankton ecology and evolution. It has been assumed that most of the viruses in seawater contain DNA and infect bacteria, but RNA-containing viruses in the ocean, which almost exclusively infect eukaryotes, have never been quantified. We compared the total mass of RNA and DNA in the viral fraction harvested from seawater and using data on the mass of nucleic acid per RNA- or DNA-containing virion, estimated the abundances of each. Our data suggest that the abundance of RNA viruses rivaled or exceeded that of DNA viruses in samples of coastal seawater. The dominant RNA viruses in the samples were marine picorna-like viruses, which have small genomes and are at or below the detection limit of common fluorescence-based counting methods. If our results are typical, this means that counts of viruses and the rate measurements that depend on them, such as viral production, are significantly underestimated by current practices. As these RNA viruses infect eukaryotes, our data imply that protists contribute more to marine viral dynamics than one might expect based on their relatively low abundance. This conclusion is a departure from the prevailing view of viruses in the ocean, but is consistent with earlier theoretical predictions.


Subject(s)
RNA Viruses/physiology , Seawater/virology , Virus Physiological Phenomena , Eukaryota/virology , Genome, Viral/genetics , RNA Viruses/genetics , Seawater/microbiology , Virion/genetics
6.
ACM SIGAPP Appl Comput Rev ; 12(4): 8-20, 2012 Dec 01.
Article in English | MEDLINE | ID: mdl-24163645

ABSTRACT

Phosphorylation is an important post-translational modification of proteins that is essential to the regulation of many cellular processes. Although most of the phosphorylation sites discovered in protein sequences have been identified experimentally, the in vivo and in vitro discovery of the sites is an expensive, time-consuming and laborious task. Therefore, the development of computational methods for prediction of protein phosphorylation sites has drawn considerable attention. In this work, we present a kernel-based probabilistic Classification Relevance Units Machine (CRUM) for in silico phosphorylation site prediction. In comparison with the popular Support Vector Machine (SVM) CRUM shows comparable predictive performance and yet provides a more parsimonious model. This is desirable since it leads to a reduction in prediction run-time, which is important in predictions on large-scale data. Furthermore, the CRUM training algorithm has lower run-time and memory complexity and has a simpler parameter selection scheme than the Relevance Vector Machine (RVM) learning algorithm. To further investigate the viability of using CRUM in phosphorylation site prediction, we construct multiple CRUM predictors using different combinations of three phosphorylation site features - BLOSUM encoding, disorder, and amino acid composition. The predictors are evaluated through cross-validation and the results show that CRUM with BLOSUM feature is among the best performing CRUM predictors in both cross-validation and benchmark experiments. A comparative study with existing prediction tools in an independent benchmark experiment suggests possible direction for further improving the predictive performance of CRUM predictors.

7.
BMC Bioinformatics ; 12 Suppl 9: S10, 2011 Oct 05.
Article in English | MEDLINE | ID: mdl-22151602

ABSTRACT

BACKGROUND: A large family of viruses that infect bacteria, called phages, is characterized by long tails used to inject DNA into their victims' cells. The tape measure protein got its name because the length of the corresponding gene is proportional to the length of the phage's tail: a fact shown by actually copying or splicing out parts of DNA in exemplar species. A natural question is whether there exist units for these tape measures, and if different tape measures have different units and lengths. Such units would allow us to retrace the evolution of tape measure proteins using their duplication/loss history. The vast number of sequenced phages genomes allows us to attack this problem with a comparative genomics approach. RESULTS: Here we describe a subset of phages whose tape measure proteins contain variable numbers of an 11 amino acids sequence repeat, aligned with sequence similarity, structural properties, and simple arithmetics. This subset provides a unique opportunity for the combinatorial study of phage evolution, without the added uncertainties of multiple alignments, which are trivial in this case, or of protein functions, that are well established. We give a heuristic that reconstructs the duplication history of these sequences, using divergent strains to discriminate between mutations that occurred before and after speciation, or lineage divergence. The heuristic is based on an efficient algorithm that gives an exhaustive enumeration of all possible parsimonious reconstructions of the duplication/speciation history of a single nucleotide. Finally, we present a method that allows, when possible, to discriminate between duplication and loss events. CONCLUSIONS: Establishing the evolutionary history of viruses is difficult, in part due to extensive recombinations and gene transfers, and high mutation rates that often erase detectable similarity between homologous genes. In this paper, we introduce new tools to address this problem.


Subject(s)
Bacteriophages/genetics , Evolution, Molecular , Genomics/methods , Repetitive Sequences, Amino Acid , Viral Proteins/chemistry , Viral Proteins/genetics , Algorithms , Gene Duplication , Sequence Deletion
8.
J Comput Biol ; 17(9): 1315-26, 2010 Sep.
Article in English | MEDLINE | ID: mdl-20874413

ABSTRACT

Comparing the genomes of two closely related viruses often produces mosaics where nearly identical sequences alternate with sequences that are unique to each genome. When several closely related genomes are compared, the unique sequences are likely to be shared with third genomes, leading to virus mosaic communities. Here we present comparative analysis of sets of Staphylococcus aureus phages that share large identical sequences with up to three other genomes, and with different partners along their genomes. We introduce mosaic graphs to represent these complex recombination events, and use them to illustrate the breath and depth of sequence sharing: some genomes are almost completely made up of shared sequences, while genomes that share very large identical sequences can adopt alternate functional modules. Mosaic graphs also allow us to identify breakpoints that could eventually be used for the construction of recombination networks. These findings have several implications on phage metagenomics assembly, on the horizontal gene transfer paradigm, and more generally on the understanding of the composition and evolutionary dynamics of virus communities.


Subject(s)
Bacteriophages/genetics , Genome, Viral , Genomics/methods , Bacteria/virology , Base Sequence , Evolution, Molecular , Molecular Sequence Data , Multigene Family , Recombination, Genetic , Sequence Alignment
9.
Genomics Proteomics Bioinformatics ; 5(2): 121-30, 2007 May.
Article in English | MEDLINE | ID: mdl-17893077

ABSTRACT

A glycosylphosphatidylinositol (GPI) anchor is a common but complex C-terminal post-translational modification of extracellular proteins in eukaryotes. Here we investigate the problem of correctly annotating GPI-anchored proteins for the growing number of sequences in public databases. We developed a computational system, called FragAnchor, based on the tandem use of a neural network (NN) and a hidden Markov model (HMM). Firstly, NN selects potential GPI-anchored proteins in a dataset, then HMM parses these potential GPI signals and refines the prediction by qualitative scoring. FragAnchor correctly predicted 91% of all the GPI-anchored proteins annotated in the Swiss-Prot database. In a large-scale analysis of 29 eukaryote proteomes, FragAnchor predicted that the percentage of highly probable GPI-anchored proteins is between 0.21% and 2.01%. The distinctive feature of FragAnchor, compared with other systems, is that it targets only the C-terminus of a protein, making it less sensitive to the background noise found in databases and possible incomplete protein sequences. Moreover, FragAnchor can be used to predict GPI-anchored proteins in all eukaryotes. Finally, by using qualitative scoring, the predictions combine both sensitivity and information content. The predictor is publicly available at [see text].


Subject(s)
Computational Biology/methods , Eukaryotic Cells/chemistry , Glycosylphosphatidylinositols/chemistry , Glycosylphosphatidylinositols/metabolism , Sequence Analysis, Protein , Amino Acid Sequence , Databases, Protein , Glycosylphosphatidylinositols/isolation & purification , Humans , Hydrophobic and Hydrophilic Interactions , Markov Chains , Models, Genetic , Molecular Sequence Data , Neural Networks, Computer , Predictive Value of Tests , Protein Processing, Post-Translational , Proteome/analysis , Sensitivity and Specificity
10.
PLoS One ; 2(9): e830, 2007 Sep 05.
Article in English | MEDLINE | ID: mdl-17786202

ABSTRACT

BACKGROUND: In environmental sequencing projects, a mix of DNA from a whole microbial community is fragmented and sequenced, with one of the possible goals being to reconstruct partial or complete genomes of members of the community. In communities with high diversity of species, a significant proportion of the sequences do not overlap any other fragment in the sample. This problem will arise not only in situations with a relatively even distribution of many species, but also when the community in a particular environment is routinely dominated by the same few species. In the former case, no genomes may be assembled at all, while in the latter case a few dominant species in an environment will always be sequenced at high coverage to the detriment of coverage of the greater number of sparse species. METHODS AND RESULTS: Here we show that, with the same global sequencing effort, separating the species into two or more sub-communities prior to sequencing can yield a much higher proportion of sequences that can be assembled. We first use the Lander-Waterman model to show that, if the expected percentage of singleton sequences is higher than 25%, then, under the uniform distribution hypothesis, splitting the community is always a wise choice. We then construct simulated microbial communities to show that the results hold for highly non-uniform distributions. We also show that, for the distributions considered in the experiments, it is possible to estimate quite accurately the relative diversity of the two sub-communities. CONCLUSION: Given the fact that several methods exist to split microbial communities based on physical properties such as size, density, surface biochemistry, or optical properties, we strongly suggest that groups involved in environmental sequencing, and expecting high diversity, consider splitting their communities in order to maximize the information content of their sequencing effort.


Subject(s)
DNA, Bacterial/genetics , DNA, Fungal/genetics , Biodiversity , Environmental Monitoring , Models, Theoretical , Sequence Analysis, DNA
11.
Genome ; 48(5): 913-23, 2005 Oct.
Article in English | MEDLINE | ID: mdl-16391697

ABSTRACT

Freezing tolerance in plants is a complex trait that occurs in many plant species during growth at low, nonfreezing temperatures, a process known as cold acclimation. This process is regulated by a multigenic system expressing broad variation in the degree of freezing tolerance among wheat cultivars. Microarray analysis is a powerful and rapid approach to gene discovery. In species such as wheat, for which large scale mutant screening and transgenic studies are not currently practical, genotype comparison by this methodology represents an essential approach to identifying key genes in the acquisition of freezing tolerance. A microarray was constructed with PCR amplified cDNA inserts from 1184 wheat expressed sequence tags (ESTs) that represent 947 genes. Gene expression during cold acclimation was compared in 2 cultivars with marked differences in freezing tolerance. Transcript levels of more than 300 genes were altered by cold. Among these, 65 genes were regulated differently between the 2 cultivars for at least 1 time point. These include genes that encode potential regulatory proteins and proteins that act in plant metabolism, including protein kinases, putative transcription factors, Ca2+ binding proteins, a Golgi localized protein, an inorganic pyrophosphatase, a cell wall associated hydrolase, and proteins involved in photosynthesis.


Subject(s)
Acclimatization/genetics , Cold Temperature , Gene Expression Regulation, Plant , Triticum/genetics , Calcium-Binding Proteins/genetics , Calcium-Binding Proteins/metabolism , Carbohydrate Metabolism/genetics , Expressed Sequence Tags , Gene Expression Profiling , Genes, Plant , Golgi Apparatus/metabolism , Oligonucleotide Array Sequence Analysis , Oxidative Stress/genetics , Photosynthesis , Plant Proteins/genetics , Plant Proteins/metabolism , Seasons , Signal Transduction , Transcription Factors/metabolism , Transcription, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL
...