Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
BMC Bioinformatics ; 10: 355, 2009 Oct 27.
Article in English | MEDLINE | ID: mdl-19860884

ABSTRACT

BACKGROUND: Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database. RESULTS: The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples. CONCLUSION: Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples.


Subject(s)
Computational Biology/methods , Genome , Genomics/methods , Phylogeny , Evolution, Molecular , Metagenomics
2.
BMC Evol Biol ; 9: 28, 2009 Feb 03.
Article in English | MEDLINE | ID: mdl-19192293

ABSTRACT

BACKGROUND: The question of how genomic processes, such as gene duplication, give rise to co-ordinated organismal properties, such as emergence of new body plans, organs and lifestyles, is of importance in developmental and evolutionary biology. Herein, we focus on the diversification of the transforming growth factor-beta (TGF-beta) pathway -- one of the fundamental and versatile metazoan signal transduction engines. RESULTS: After an investigation of 33 genomes, we show that the emergence of the TGF-beta pathway coincided with appearance of the first known animal species. The primordial pathway repertoire consisted of four Smads and four receptors, similar to those observed in the extant genome of the early diverging tablet animal (Trichoplax adhaerens). We subsequently retrace duplications in ancestral genomes on the lineage leading to humans, as well as lineage-specific duplications, such as those which gave rise to novel Smads and receptors in teleost fishes. We conclude that the diversification of the TGF-beta pathway can be parsimoniously explained according to the 2R model, with additional rounds of duplications in teleost fishes. Finally, we investigate duplications followed by accelerated evolution which gave rise to an atypical TGF-beta pathway in free-living bacterial feeding nematodes of the genus Rhabditis. CONCLUSION: Our results challenge the view of well-conserved developmental pathways. The TGF-beta signal transduction engine has expanded through gene duplication, continually adopting new functions, as animals grew in anatomical complexity, colonized new environments, and developed an active immune system.


Subject(s)
Evolution, Molecular , Multigene Family , Transforming Growth Factor beta/genetics , Animals , Bayes Theorem , Gene Duplication , Genome , Humans , Likelihood Functions , Phylogeny , Sequence Alignment , Sequence Homology, Amino Acid , Signal Transduction/genetics
3.
BMC Evol Biol ; 8: 247, 2008 Sep 09.
Article in English | MEDLINE | ID: mdl-18782449

ABSTRACT

BACKGROUND: We describe a function-driven approach to the analysis of metabolism which takes into account the phylogenetic origin of biochemical reactions to reveal subtle lineage-specific metabolic innovations, undetectable by more traditional methods based on sequence comparison. The origins of reactions and thus entire pathways are inferred using a simple taxonomic classification scheme that describes the evolutionary course of events towards the lineage of interest. We investigate the evolutionary history of the human metabolic network extracted from a metabolic database, construct a network of interconnected pathways and classify this network according to the taxonomic categories representing eukaryotes, metazoa and vertebrates. RESULTS: It is demonstrated that lineage-specific innovations correspond to reactions and pathways associated with key phenotypic changes during evolution, such as the emergence of cellular organelles in eukaryotes, cell adhesion cascades in metazoa and the biosynthesis of complex cell-specific biomolecules in vertebrates. CONCLUSION: This phylogenetic view of metabolic networks puts gene innovations within an evolutionary context, demonstrating how the emergence of a phenotype in a lineage provides a platform for the development of specialized traits.


Subject(s)
Evolution, Molecular , Metabolic Networks and Pathways , Models, Genetic , Phylogeny , Cholesterol/metabolism , Computational Biology/methods , Databases, Genetic , Glycosphingolipids/metabolism , Glycosylation , Humans
4.
BMC Genomics ; 8: 460, 2007 Dec 14.
Article in English | MEDLINE | ID: mdl-18081932

ABSTRACT

BACKGROUND: Gene fusion detection - also known as the 'Rosetta Stone' method - involves the identification of fused composite genes in a set of reference genomes, which indicates potential interactions between its un-fused counterpart genes in query genomes. The precision of this method typically improves with an ever-increasing number of reference genomes. RESULTS: In order to explore the usefulness and scope of this approach for protein interaction prediction and generate a high-quality, non-redundant set of interacting pairs of proteins across a wide taxonomic range, we have exhaustively performed gene fusion analysis for 184 genomes using an efficient variant of a previously developed protocol. By analyzing interaction graphs and applying a threshold that limits the maximum number of possible interactions within the largest graph components, we show that we can reduce the number of implausible interactions due to the detection of promiscuous domains. With this generally applicable approach, we generate a robust set of over 2 million distinct and testable interactions encompassing 696,894 proteins in 184 species or strains, most of which have never been the subject of high-throughput experimental proteomics. We investigate the cumulative effect of increasing numbers of genomes on the fidelity and quantity of predictions, and show that, for large numbers of genomes, predictions do not become saturated but continue to grow linearly, for the majority of the species. We also examine the percentage of component (and composite) proteins with relation to the number of genes and further validate the functional categories that are highly represented in this robust set of detected genome-wide interactions. CONCLUSION: We illustrate the phylogenetic and functional diversity of gene fusion events across genomes, and their usefulness for accurate prediction of protein interaction and function.


Subject(s)
Gene Fusion , Gene Regulatory Networks , Arabidopsis/genetics , Bacterial Proteins/metabolism , Chlamydia/genetics , Genetic Variation , Genome , Phylogeny , Plant Proteins/metabolism , Protein Binding , Reproducibility of Results
5.
PLoS Comput Biol ; 3(10): 2032-42, 2007 Oct.
Article in English | MEDLINE | ID: mdl-17967053

ABSTRACT

Network analysis transcends conventional pairwise approaches to data analysis as the context of components in a network graph can be taken into account. Such approaches are increasingly being applied to genomics data, where functional linkages are used to connect genes or proteins. However, while microarray gene expression datasets are now abundant and of high quality, few approaches have been developed for analysis of such data in a network context. We present a novel approach for 3-D visualisation and analysis of transcriptional networks generated from microarray data. These networks consist of nodes representing transcripts connected by virtue of their expression profile similarity across multiple conditions. Analysing genome-wide gene transcription across 61 mouse tissues, we describe the unusual topography of the large and highly structured networks produced, and demonstrate how they can be used to visualise, cluster, and mine large datasets. This approach is fast, intuitive, and versatile, and allows the identification of biological relationships that may be missed by conventional analysis techniques. This work has been implemented in a freely available open-source application named BioLayout Express(3D).


Subject(s)
Computational Biology/methods , Gene Expression Profiling/methods , Gene Expression Regulation , Oligonucleotide Array Sequence Analysis/methods , Transcription, Genetic , Algorithms , Animals , Cluster Analysis , Gene Expression , Gene Regulatory Networks , Imaging, Three-Dimensional , Mice , Pattern Recognition, Automated , Software
7.
BMC Bioinformatics ; 8 Suppl 4: S3, 2007 May 22.
Article in English | MEDLINE | ID: mdl-17570146

ABSTRACT

Using a previously developed automated method for enzyme annotation, we report the re-annotation of the ENZYME database and the analysis of local error rates per class. In control experiments, we demonstrate that the method is able to correctly re-annotate 91% of all Enzyme Classification (EC) classes with high coverage (755 out of 827). Only 44 enzyme classes are found to contain false positives, while the remaining 28 enzyme classes are not represented. We also show cases where the re-annotation procedure results in partial overlaps for those few enzyme classes where a certain inconsistency might appear between homologous proteins, mostly due to function specificity. Our results allow the interactive exploration of the EC hierarchy for known enzyme families as well as putative enzyme sequences that may need to be classified within the EC hierarchy. These aspects of our framework have been incorporated into a web-server, called CORRIE, which stands for Correspondence Indicator Estimation and allows the interactive prediction of a functional class for putative enzymes from sequence alone, supported by probabilistic measures in the context of the pre-calculated Correspondence Indicators of known enzymes with the functional classes of the EC hierarchy. The CORRIE server is available at: http://www.genomes.org/services/corrie/.


Subject(s)
Algorithms , Enzymes/chemistry , Enzymes/metabolism , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Software , Amino Acid Sequence , Confidence Intervals , Data Interpretation, Statistical , Enzymes/classification , Molecular Sequence Data , Sensitivity and Specificity , Sequence Homology, Amino Acid
8.
Genome Biol ; 7(10): R89, 2006.
Article in English | MEDLINE | ID: mdl-17029626

ABSTRACT

BACKGROUND: Gene duplications have been hypothesized to be a major factor in enabling the evolution of tissue differentiation. Analyses of the expression profiles of duplicate genes in mammalian tissues have indicated that, with time, the expression patterns of duplicate genes diverge and become more tissue specific. We explored the relationship between duplication events, the time at which they took place, and both the expression breadth of the duplicated genes and the cumulative expression breadth of the gene family to which they belong. RESULTS: We show that only duplicates that arose through post-multicellularity duplication events show a tendency to become more specifically expressed, whereas such a tendency is not observed for duplicates that arose in a unicellular ancestor. Unlike the narrow expression profile of the duplicated genes, the overall expression of gene families tends to maintain a global expression pattern. CONCLUSION: The work presented here supports the view suggested by the subfunctionalization model, namely that expression divergence in different tissues, following gene duplication, promotes the retention of a gene in the genome of multicellular species. The global expression profile of the gene families suggests division of expression between family members, whose expression becomes specialized. Because specialization of expression is coupled with an increased rate of sequence divergence, it can facilitate the evolution of new, tissue-specific functions.


Subject(s)
Evolution, Molecular , Gene Duplication , Gene Expression Regulation , Proteins/genetics , Animals , Cell Differentiation , Genes, Duplicate , Kinetics , Mice , Sequence Homology, Amino Acid , Species Specificity
9.
Res Microbiol ; 157(1): 57-68, 2006.
Article in English | MEDLINE | ID: mdl-16431085

ABSTRACT

Using an algorithm for ancestral state inference of gene content, given a large number of extant genome sequences and a phylogenetic tree, we aim to reconstruct the gene content of the last universal common ancestor (LUCA), a hypothetical life form that presumably was the progenitor of the three domains of life. The method allows for gene loss, previously found to be a major factor in shaping gene content, and thus the estimate of LUCA's gene content appears to be substantially higher than that proposed previously, with a typical number of over 1000 gene families, of which more than 90% are also functionally characterized. More precisely, when only prokaryotes are considered, the number varies between 1006 and 1189 gene families while when eukaryotes are also included, this number increases to between 1344 and 1529 families depending on the underlying phylogenetic tree. Therefore, the common belief that the hypothetical genome of LUCA should resemble those of the smallest extant genomes of obligate parasites is not supported by recent advances in computational genomics. Instead, a fairly complex genome similar to those of free-living prokaryotes, with a variety of functional capabilities including metabolic transformation, information processing, membrane/transport proteins and complex regulation, shared between the three domains of life, emerges as the most likely progenitor of life on Earth, with profound repercussions for planetary exploration and exobiology.


Subject(s)
Earth, Planet , Evolution, Molecular , Exobiology , Genome , Phylogeny , Algorithms , Gene Transfer, Horizontal
10.
Nucleic Acids Res ; 33(19): 6083-9, 2005.
Article in English | MEDLINE | ID: mdl-16246909

ABSTRACT

The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB in the BioCyc collection describes the genome and predicted metabolic network of a single organism, inferred from the MetaCyc database, which is a reference source on metabolic pathways from multiple organisms. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms. This paper discusses the computational methodology by which the BioCyc collection has been expanded, and presents an aggregate analysis of the collection that includes the range of number of pathways present in these organisms, and the most frequently observed pathways. We seek scientists to adopt and curate individual PGDBs within the BioCyc collection. Only by harnessing the expertise of many scientists we can hope to produce biological databases, which accurately reflect the depth and breadth of knowledge that the biomedical research community is producing.


Subject(s)
Databases, Genetic , Genome , Animals , Computational Biology , Genome, Archaeal , Genome, Bacterial , Genomics , Humans , Metabolism/genetics
11.
Bioinformatics ; 21(19): 3806-10, 2005 Oct 01.
Article in English | MEDLINE | ID: mdl-16216832

ABSTRACT

MOTIVATION: CoGenT++ is a data environment for computational research in comparative and functional genomics, designed to address issues of consistency, reproducibility, scalability and accessibility. DESCRIPTION: CoGenT++ facilitates the re-distribution of all fully sequenced and published genomes, storing information about species, gene names and protein sequences. We describe our scalable implementation of ProXSim, a continually updated all-against-all similarity database, which stores pairwise relationships between all genome sequences. Based on these similarities, derived databases are generated for gene fusions--AllFuse, putative orthologs--OFAM, protein families--TRIBES, phylogenetic profiles--ProfUse and phylogenetic trees. Extensions based on the CoGenT++ environment include disease gene prediction, pattern discovery, automated domain detection, genome annotation and ancestral reconstruction. CONCLUSION: CoGenT++ provides a comprehensive environment for computational genomics, accessible primarily for large-scale analyses as well as manual browsing.


Subject(s)
Chromosome Mapping/methods , Computer Graphics , Database Management Systems , Databases, Genetic , Genomics/methods , Sequence Analysis/methods , User-Computer Interface , Computational Biology/methods , Information Storage and Retrieval/methods , Software , Systems Integration
12.
Appl Bioinformatics ; 4(1): 71-4, 2005.
Article in English | MEDLINE | ID: mdl-16000016

ABSTRACT

Visualisation of biological networks is becoming a common task for the analysis of high-throughput data. These networks correspond to a wide variety of biological relationships, such as sequence similarity, metabolic pathways, gene regulatory cascades and protein interactions. We present a general approach for the representation and analysis of networks of variable type, size and complexity. The application is based on the original BioLayout program (C-language implementation of the Fruchterman-Rheingold layout algorithm), entirely re-written in Java to guarantee portability across platforms. BioLayout(Java) provides broader functionality, various analysis techniques, extensions for better visualisation and a new user interface. Examples of analysis of biological networks using BioLayout(Java) are presented.


Subject(s)
Computer Graphics , Gene Expression Regulation/physiology , Proteome/chemistry , Proteome/metabolism , Signal Transduction/physiology , Software , User-Computer Interface , Programming Languages , Structure-Activity Relationship
13.
Bioinformatics ; 21(16): 3429-30, 2005 Aug 15.
Article in English | MEDLINE | ID: mdl-15961438

ABSTRACT

MOTIVATION: At present, mapping of sequence identifiers across databases is a daunting, time-consuming and computationally expensive process, usually achieved by sequence similarity searches with strict threshold values. SUMMARY: We present a rapid and efficient method to map sequence identifiers across databases. The method uses the MD5 checksum algorithm for message integrity to generate sequence fingerprints and uses these fingerprints as hash strings to map sequences across databases. The program, called MagicMatch, is able to cross-link any of the major sequence databases within a few seconds on a modest desktop computer.


Subject(s)
Algorithms , Database Management Systems , Databases, Protein , Information Storage and Retrieval/methods , Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Software , Amino Acid Sequence , Molecular Sequence Data , Proteins/analysis , Proteins/classification
14.
Genome Res ; 15(7): 954-9, 2005 Jul.
Article in English | MEDLINE | ID: mdl-15965028

ABSTRACT

It has previously been suggested that the phylogeny of microbial species might be better described as a network containing vertical and horizontal gene transfer (HGT) events. Yet, all phylogenetic reconstructions so far have presented microbial trees rather than networks. Here, we present a first attempt to reconstruct such an evolutionary network, which we term the "net of life". We use available tree reconstruction methods to infer vertical inheritance, and use an ancestral state inference algorithm to map HGT events on the tree. We also describe a weighting scheme used to estimate the number of genes exchanged between pairs of organisms. We demonstrate that vertical inheritance constitutes the bulk of gene transfer on the tree of life. We term the bulk of horizontal gene flow between tree nodes as "vines", and demonstrate that multiple but mostly tiny vines interconnect the tree. Our results strongly suggest that the HGT network is a scale-free graph, a finding with important implications for genome evolution. We propose that genes might propagate extremely rapidly across microbial species through the HGT network, using certain organisms as hubs.


Subject(s)
Archaea/genetics , Bacteria/genetics , Gene Transfer, Horizontal , Phylogeny , Algorithms , Computational Biology , Evolution, Molecular , Genome, Bacterial , Models, Genetic
16.
Nucleic Acids Res ; 33(2): 616-21, 2005.
Article in English | MEDLINE | ID: mdl-15681613

ABSTRACT

Species evolutionary relationships have traditionally been defined by sequence similarities of phylogenetic marker molecules, recently followed by whole-genome phylogenies based on gene order, average ortholog similarity or gene content. Here, we introduce genome conservation--a novel metric of evolutionary distances between species that simultaneously takes into account, both gene content and sequence similarity at the whole-genome level. Genome conservation represents a robust distance measure, as demonstrated by accurate phylogenetic reconstructions. The genome conservation matrix for all presently sequenced organisms exhibits a remarkable ability to define evolutionary relationships across all taxonomic ranges. An assessment of taxonomic ranks with genome conservation shows that certain ranks are inadequately described and raises the possibility for a more precise and quantitative taxonomy in the future. All phylogenetic reconstructions are available at the genome phylogeny server: .


Subject(s)
Computational Biology/methods , Genomics/methods , Phylogeny , Bacteria/classification , Bacteria/genetics , Evolution, Molecular , Genome, Bacterial , Proteobacteria/classification , Proteobacteria/genetics
17.
Bioinformatics ; 19(11): 1451-2, 2003 Jul 22.
Article in English | MEDLINE | ID: mdl-12874064

ABSTRACT

SUMMARY: We present a database of fully sequenced and published genomes to facilitate the re-distribution of data and ensure reproducibility of results in the field of computational genomics. For its design we have implemented an extremely simple yet powerful schema to allow linking of genome sequence data to other resources. AVAILABILITY: http://maine.ebi.ac.uk:8000/services/cogent/


Subject(s)
Database Management Systems , Databases, Genetic , Documentation , Genomics/methods , Information Storage and Retrieval/methods , Sequence Analysis, DNA/methods , Computational Biology/methods , Internet
18.
Genome Biol ; 4(5): 402, 2003.
Article in English | MEDLINE | ID: mdl-12734008

ABSTRACT

By the end of 2002, we witnessed the landmark submission of the 100th complete genome sequence in the databases. An overview of these genomes reveals certain interesting trends and provides valuable insights into possible future developments.


Subject(s)
Genome , Animals , Computational Biology/methods , Computational Biology/trends , Humans , Phylogeny , Proteins/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...