Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 75
Filter
Add more filters










Publication year range
2.
Bioinformatics ; 24(17): 1935-41, 2008 Sep 01.
Article in English | MEDLINE | ID: mdl-18593717

ABSTRACT

MOTIVATION: Biomedical literature is the principal repository of biomedical knowledge, with PubMed being the most complete database collecting, organizing and analyzing such textual knowledge. There are numerous efforts that attempt to exploit this information by using text mining and machine learning techniques. We developed a novel approach, called PuReD-MCL (Pubmed Related Documents-MCL), which is based on the graph clustering algorithm MCL and relevant resources from PubMed. METHODS: PuReD-MCL avoids using natural language processing (NLP) techniques directly; instead, it takes advantage of existing resources, available from PubMed. PuReD-MCL then clusters documents efficiently using the MCL graph clustering algorithm, which is based on graph flow simulation. This process allows users to analyse the results by highlighting important clues, and finally to visualize the clusters and all relevant information using an interactive graph layout algorithm, for instance BioLayout Express 3D. RESULTS: The methodology was applied to two different datasets, previously used for the validation of the document clustering tool TextQuest. The first dataset involves the organisms Escherichia coli and yeast, whereas the second is related to Drosophila development. PuReD-MCL successfully reproduces the annotated results obtained from TextQuest, while at the same time provides additional insights into the clusters and the corresponding documents. AVAILABILITY: Source code in perl and R are available from http://tartara.csd.auth.gr/~theodos/


Subject(s)
Artificial Intelligence , Cluster Analysis , Information Storage and Retrieval/methods , Natural Language Processing , Pattern Recognition, Automated/methods , PubMed , Software , Algorithms , Database Management Systems
3.
Nucleic Acids Res ; 30(7): 1575-84, 2002 Apr 01.
Article in English | MEDLINE | ID: mdl-11917018

ABSTRACT

Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.


Subject(s)
Algorithms , Databases, Protein , Proteins/genetics , Amino Acid Sequence , Genome, Human , Humans , Internet , Molecular Sequence Data , Sequence Alignment , Sequence Homology, Amino Acid , Transcription Factor TFIIB , Transcription Factors/genetics
4.
Nucleic Acids Res ; 29(21): 4395-404, 2001 Nov 01.
Article in English | MEDLINE | ID: mdl-11691927

ABSTRACT

Whole-genome clustering of the two available genome sequences of Helicobacter pylori strains 26695 and J99 allows the detection of 110 and 52 strain-specific genes, respectively. This set of strain-specific genes was compared with the sets obtained with other computational approaches of direct genome comparison as well as experimental data from microarray analysis. A considerable number of novel function assignments is possible using database-driven sequence annotation, although the function of the majority of the identified genes remains unknown. Using whole-genome clustering, it is also possible to detect species-specific genes by comparing the two H.pylori strains against the genome sequence of Campylobacter jejuni. It is interesting that the majority of strain-specific genes appear to be species specific. Finally, we introduce a novel approach to gene position analysis by employing measures from directional statistics. We show that although the two strains exhibit differences with respect to strain-specific gene distributions, this is due to the extensive genome rearrangements. If these are taken into account, a common pattern for the genome dynamics of the two Helicobacter strains emerges, suggestive of certain spatial constraints that may act as control mechanisms of gene flux.


Subject(s)
Evolution, Molecular , Genes, Bacterial/genetics , Genome, Bacterial , Genomics , Helicobacter pylori/classification , Helicobacter pylori/genetics , Amino Acid Sequence , Bacterial Proteins/chemistry , Bacterial Proteins/classification , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Campylobacter jejuni/genetics , Computational Biology , Databases, Protein , Gene Order/genetics , Internet , Models, Genetic , Molecular Sequence Data , Sequence Alignment , Species Specificity
5.
Bioinformatics ; 17(9): 853-4, 2001 Sep.
Article in English | MEDLINE | ID: mdl-11590107

ABSTRACT

UNLABELLED: Graph layout is extensively used in the field of mathematics and computer science, however these ideas and methods have not been extended in a general fashion to the construction of graphs for biological data. To this end, we have implemented a version of the Fruchterman Rheingold graph layout algorithm, extensively modified for the purpose of similarity analysis in biology. This algorithm rapidly and effectively generates clear two (2D) or three-dimensional (3D) graphs representing similarity relationships such as protein sequence similarity. The implementation of the algorithm is general and applicable to most types of similarity information for biological data. AVAILABILITY: BioLayout is available for most UNIX platforms at the following web-site: http://www.ebi.ac.uk/research/cgg/services/layout.


Subject(s)
Algorithms , Computer Graphics , Amino Acid Sequence , Computer Graphics/statistics & numerical data , Computer Graphics/trends , Databases, Protein/statistics & numerical data , Databases, Protein/trends , Image Processing, Computer-Assisted/statistics & numerical data , Image Processing, Computer-Assisted/trends , Imaging, Three-Dimensional/statistics & numerical data , Imaging, Three-Dimensional/trends , Software/statistics & numerical data , Software/trends
6.
Genome Res ; 11(9): 1503-10, 2001 Sep.
Article in English | MEDLINE | ID: mdl-11544193

ABSTRACT

We have analyzed the known metabolic enzymes of Escherichia coli in relation to their biochemical reaction properties and their involvement in biochemical pathways. All enzymes involved in small-molecule metabolism and their corresponding protein sequences have been extracted from the EcoCyc database. These 548 metabolic enzymes are clustered into 405 protein families according to sequence similarity. In this study, we examine the functional versatility within enzyme families in terms of their reaction capabilities and pathway participation. In addition, we examine the molecular diversity of reactions and pathways according to their presence across enzyme families. These complex, many-to-many relationships between protein sequence and biochemical function reveal a significant degree of correlation between enzyme families and reactions. Pathways, however, appear to require more than one enzyme type to perform their complex biochemical transformations. Finally, the distribution of enzyme family members across different pathways provides support for the "recruitment" hypothesis of biochemical pathway evolution.


Subject(s)
Enzymes/physiology , Escherichia coli/enzymology , Escherichia coli/genetics , Multigene Family , Amino Acid Sequence , Computational Biology , Databases, Factual , Enzymes/genetics , Enzymes/metabolism , Genetic Variation , Molecular Sequence Data , Sequence Alignment , Structure-Activity Relationship
7.
Pac Symp Biocomput ; : 384-95, 2001.
Article in English | MEDLINE | ID: mdl-11262957

ABSTRACT

We present an algorithm for large-scale document clustering of biological text, obtained from Medline abstracts. The algorithm is based on statistical treatment of terms, stemming, the idea of a 'go-list', unsupervised machine learning and graph layout optimization. The method is flexible and robust, controlled by a small number of parameter values. Experiments show that the resulting document clusters are meaningful as assessed by cluster-specific terms. Despite the statistical nature of the approach, with minimal semantic analysis, the terms provide a shallow description of the document corpus and support concept discovery.


Subject(s)
Abstracting and Indexing , Algorithms , MEDLINE , Molecular Biology , Animals , Artificial Intelligence , Cluster Analysis , Drosophila/embryology , Terminology as Topic
8.
Nucleic Acids Res ; 29(7): 1608-15, 2001 Apr 01.
Article in English | MEDLINE | ID: mdl-11266564

ABSTRACT

The global amino acid compositions as deduced from the complete genomic sequences of six thermophilic archaea, two thermophilic bacteria, 17 mesophilic bacteria and two eukaryotic species were analysed by hierarchical clustering and principal components analysis. Both methods showed an influence of several factors on amino acid composition. Although GC content has a dominant effect, thermophilic species can be identified by their global amino acid compositions alone. This study presents a careful statistical analysis of factors that affect amino acid composition and also yielded specific features of the average amino acid composition of thermophilic species. Moreover, we introduce the first example of a 'compositional tree' of species that takes into account not only homologous proteins, but also proteins unique to particular species. We expect this simple yet novel approach to be a useful additional tool for the study of phylogeny at the genome level.


Subject(s)
Amino Acids/genetics , Genome , Amino Acids/chemistry , Animals , Archaea/genetics , Bacteria/genetics , Caenorhabditis elegans/genetics , Databases, Factual , Genome, Archaeal , Genome, Bacterial , Genome, Fungal , Phylogeny , Saccharomyces cerevisiae/genetics , Species Specificity , Temperature
9.
Genome Biol ; 2(1): INTERACTIONS0001, 2001.
Article in English | MEDLINE | ID: mdl-11178275

ABSTRACT

To assess how automatic function assignment will contribute to genome annotation in the next five years, we have performed an analysis of 31 available genome sequences. An emerging pattern is that function can be predicted for almost two-thirds of the 73,500 genes that were analyzed. Despite progress in computational biology, there will always be a great need for large-scale experimental determination of protein function.


Subject(s)
Genome , Sequence Analysis, DNA , Animals , Genome, Human , Genomics/methods , Genomics/trends , Humans , Proteome , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/trends
10.
Bioinformatics ; 17(1): 95-7, 2001 Jan.
Article in English | MEDLINE | ID: mdl-11222266

ABSTRACT

The mechanisms controlling gene regulation appear to be fundamentally different in eukaryotes and prokaryotes (Struhl (1999) CELL, 98, 1-4). To investigate this diversity further, we have analysed the distribution of all known transcription-associated proteins (TAPs), as reflected by sequence database annotations. Our results for the primary phylogenetic domains (Archaea, Bacteria and Eukaryota) show that TAP families are mostly taxon-specific and very few transcriptional regulators are common across these domains.


Subject(s)
Computational Biology , Proteins/genetics , Databases, Factual , Phylogeny , Proteins/classification , Transcription Factors/classification , Transcription Factors/genetics , Transcription, Genetic
11.
Genome Biol ; 2(9): RESEARCH0034, 2001.
Article in English | MEDLINE | ID: mdl-11820254

ABSTRACT

BACKGROUND: It has recently been shown that the detection of gene fusion events across genomes can be used for predicting functional associations of proteins, including physical interaction or complex formation. To obtain such predictions we have made an exhaustive search for gene fusion events within 24 available completely sequenced genomes. RESULTS: Each genome was used as a query against the remaining 23 complete genomes to detect gene fusion events. Using an improved, fully automatic protocol, a total of 7,224 single-domain proteins that are components of gene fusions in other genomes were detected, many of which were identified for the first time. The total number of predicted pairwise functional associations is 39,730 for all genomes. Component pairs were identified by virtue of their similarity to 2,365 multidomain composite proteins. We also show for the first time that gene fusion is a complex evolutionary process with a number of contributory factors, including paralogy, genome size and phylogenetic distance. On average, 9% of genes in a given genome appear to code for single-domain, component proteins predicted to be functionally associated. These proteins are detected by an additional 4% of genes that code for fused, composite proteins. CONCLUSIONS: These results provide an exhaustive set of functionally associated genes and also delineate the power of fusion analysis for the prediction of protein interactions.


Subject(s)
Artificial Gene Fusion , Evolution, Molecular , Genome , Proteins/genetics , Proteins/metabolism , Recombination, Genetic/genetics , Algorithms , Animals , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Caenorhabditis elegans Proteins/genetics , Caenorhabditis elegans Proteins/metabolism , Computational Biology/methods , Drosophila Proteins/genetics , Drosophila Proteins/metabolism , Fungal Proteins/genetics , Fungal Proteins/metabolism , Gene Expression Profiling , Multigene Family/genetics , Phylogeny , Protein Binding , Recombinant Fusion Proteins/genetics , Recombinant Proteins/genetics , Reproducibility of Results , Two-Hybrid System Techniques
12.
RNA ; 7(12): 1693-701, 2001 Dec.
Article in English | MEDLINE | ID: mdl-11780626

ABSTRACT

Domains rich in alternating arginine and serine residues (RS domains) are frequently found in metazoan proteins involved in pre-mRNA splicing. The RS domains of splicing factors associate with each other and are important for the formation of protein-protein interactions required for both constitutive and regulated splicing. The prevalence of the RS domain in splicing factors suggests that it might serve as a useful signature for the identification of new proteins that function in pre-mRNA processing, although it remains to be determined whether RS domains also participate in other cellular functions. Using database search and sequence clustering methods, we have identified and categorized RS domain proteins encoded within the entire genomes of Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae. This genome-wide survey revealed a surprising complexity of RS domain proteins in metazoans with functions associated with chromatin structure, transcription by RNA polymerase II, cell cycle, and cell structure, as well as pre-mRNA processing. Also identified were RS domain proteins in S. cerevisiae with functions associated with cell structure, osmotic regulation, and cell cycle progression. The results thus demonstrate an effective strategy for the genomic mining of RS domain proteins. The identification of many new proteins using this strategy has provided a database of factors that are candidates for forming RS domain-mediated interactions associated with different steps in pre-mRNA processing, in addition to other cellular functions.


Subject(s)
Amino Acid Motifs/genetics , Computational Biology/methods , Molecular Biology/methods , Protein Structure, Tertiary/genetics , Animals , Arginine/genetics , Caenorhabditis elegans/genetics , Cell Cycle , Chromatin/metabolism , Drosophila melanogaster/genetics , Evolution, Molecular , Genome , Humans , Phosphoprotein Phosphatases , Protein Kinases , RNA Polymerase II/metabolism , RNA Processing, Post-Transcriptional , Research Design , Saccharomyces cerevisiae/genetics , Serine/genetics , Transcription, Genetic
13.
Bioinformatics ; 16(10): 915-22, 2000 Oct.
Article in English | MEDLINE | ID: mdl-11120681

ABSTRACT

MOTIVATION: Sensitive detection and masking of low-complexity regions in protein sequences. Filtered sequences can be used in sequence comparison without the risk of matching compositionally biased regions. The main advantage of the method over similar approaches is the selective masking of single residue types without affecting other, possibly important, regions. RESULTS: A novel algorithm for low-complexity region detection and selective masking. The algorithm is based on multiple-pass Smith-Waterman comparison of the query sequence against twenty homopolymers with infinite gap penalties. The output of the algorithm is both the masked query sequence for further analysis, e.g. database searches, as well as the regions of low complexity. The detection of low-complexity regions is highly specific for single residue types. It is shown that this approach is sufficient for masking database query sequences without generating false positives. The algorithm is benchmarked against widely available algorithms using the 210 genes of Plasmodium falciparum chromosome 2, a dataset known to contain a large number of low-complexity regions. AVAILABILITY: CAST (version 1.0) executable binaries are available to academic users free of charge under license. Web site entry point, server and additional material: http://www.ebi.ac.uk/research/cgg/services/cast/


Subject(s)
Algorithms , DNA, Protozoan/chemistry , Plasmodium falciparum/genetics , Sequence Analysis, DNA/methods , Animals , DNA, Protozoan/genetics , Databases, Factual , Genes, Protozoan , Open Reading Frames
14.
Nucleic Acids Res ; 28(22): 4573-6, 2000 Nov 15.
Article in English | MEDLINE | ID: mdl-11071948

ABSTRACT

The proliferation of genome sequence data has led to the development of a number of tools and strategies that facilitate computational analysis. These methods include the identification of motif patterns, membership of the query sequences in family databases, metabolic pathway involvement and gene proximity. We re-examined the completely sequenced genome of Thermotoga maritima by employing the combined use of the above methods. By analyzing all 1877 proteins encoded in this genome, we identified 193 cases of conflicting annotations (10%), of which 164 are new function predictions and 29 are amendments of previously proposed assignments. These results suggest that the combined use of existing computational tools can resolve inconclusive sequence similarities and significantly improve the prediction of protein function from genome sequence.


Subject(s)
Genome, Bacterial , Sequence Alignment/methods , Thermotoga maritima/genetics , Computational Biology , Genes, Bacterial/genetics , Open Reading Frames , Sequence Analysis
16.
FEBS Lett ; 480(1): 42-8, 2000 Aug 25.
Article in English | MEDLINE | ID: mdl-10967327

ABSTRACT

Computational genomics is a subfield of computational biology that deals with the analysis of entire genome sequences. Transcending the boundaries of classical sequence analysis, computational genomics exploits the inherent properties of entire genomes by modelling them as systems. We review recent developments in the field, discuss in some detail a number of novel approaches that take into account the genomic context and argue that progress will be made by novel knowledge representation and simulation technologies.


Subject(s)
Computational Biology/methods , Computational Biology/trends , Genes , Genome , Animals , Computer Simulation , Databases as Topic , Genes/genetics , Genes/physiology , Humans , Multigene Family/genetics , Recombinant Fusion Proteins/genetics , Sequence Alignment
17.
Pac Symp Biocomput ; : 541-52, 2000.
Article in English | MEDLINE | ID: mdl-10902201

ABSTRACT

This paper motivates the use of Information Extraction (IE) for gathering data on protein interactions, describes the customization of an existing IE system, SRI's Highlight, for this task and presents the results of an experiment on unseen Medline abstracts which show that customization to a new domain can be fast, reliable and cost-effective.


Subject(s)
Information Storage and Retrieval , MEDLINE , Proteins/metabolism , Abstracting and Indexing , Language , Linguistics
18.
Bioinformatics ; 16(5): 451-7, 2000 May.
Article in English | MEDLINE | ID: mdl-10871267

ABSTRACT

MOTIVATION: Efficient, accurate and automatic clustering of large protein sequence datasets, such as complete proteomes, into families, according to sequence similarity. Detection and correction of false positive and negative relationships with subsequent detection and resolution of multi-domain proteins. RESULTS: A new algorithm for the automatic clustering of protein sequence datasets has been developed. This algorithm represents all similarity relationships within the dataset in a binary matrix. Removal of false positives is achieved through subsequent symmetrification of the matrix using a Smith-Waterman dynamic programming alignment algorithm. Detection of multi-domain protein families and further false positive relationships within the symmetrical matrix is achieved through iterative processing of matrix elements with successive rounds of Smith-Waterman dynamic programming alignments. Recursive single-linkage clustering of the corrected matrix allows efficient and accurate family representation for each protein in the dataset. Initial clusters containing multi-domain families, are split into their constituent clusters using the information obtained by the multi-domain detection step. This algorithm can hence quickly and accurately cluster large protein datasets into families. Problems due to the presence of multi-domain proteins are minimized, allowing more precise clustering information to be obtained automatically. AVAILABILITY: GeneRAGE (version 1.0) executable binaries for most platforms may be obtained from the authors on request. The system is available to academic users free of charge under license.


Subject(s)
Algorithms , Proteins/chemistry , Proteins/genetics , Sequence Alignment/methods , Amino Acid Sequence , Bacterial Proteins/chemistry , Bacterial Proteins/genetics , Cluster Analysis , Databases, Factual , Fungal Proteins/chemistry , Fungal Proteins/genetics , Genome, Bacterial , Genome, Fungal , Protein Structure, Tertiary , Sequence Alignment/statistics & numerical data
20.
Genome Res ; 10(4): 568-76, 2000 Apr.
Article in English | MEDLINE | ID: mdl-10779499

ABSTRACT

The EcoCyc database characterizes the known network of Escherichia coli small-molecule metabolism. Here we present a computational analysis of the global properties of that network, which consists of 744 reactions that are catalyzed by 607 enzymes. The reactions are organized into 131 pathways. Of the metabolic enzymes, 100 are multifunctional, and 68 of the reactions are catalyzed by >1 enzyme. The network contains 791 chemical substrates. Other properties considered by the analysis include the distribution of enzyme subunit organization, and the distribution of modulators of enzyme activity and of enzyme cofactors. The dimensions chosen for this analysis can be employed for comparative functional analysis of complete genomes.


Subject(s)
Escherichia coli/metabolism , Catalysis , Computational Biology/methods , Databases, Factual , Enzyme Activation/genetics , Escherichia coli/enzymology , Escherichia coli/genetics , Genome, Bacterial , Multienzyme Complexes/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...