Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
J Theor Biol ; 256(2): 215-26, 2009 Jan 21.
Article in English | MEDLINE | ID: mdl-18977232

ABSTRACT

We present a thorough analysis of the relation between amino acid sequence and local three-dimensional structure in proteins. A library of overlapping local structural prototypes was built using an unsupervised clustering approach called "hybrid protein model" (HPM). The HPM carries out a multiple structural alignment of local folds from a non-redundant protein structure databank encoded into a structural alphabet composed of 16 protein blocks (PBs). Following previous research focusing on the HPM protocol, we have considered gaps in the local structure prototype. This methodology allows to have variable length fragments. Hence, 120 local structure prototypes were obtained. Twenty-five percent of the protein fragments learnt by HPM had gaps. An investigation of tight turns suggested that they are mainly derived from three PB series with precise locations in the HPM. The amino acid information content of the whole conformational classes was tackled by multivariate methods, e.g., canonical correlation analysis. It points out the presence of seven amino acid equivalence classes showing high propensities for preferential local structures. In the same way, definition of "contrast factors" based on sequence-structure properties underline the specificity of certain structural prototypes, e.g., the dependence of Gly or Asn-rich turns to a limited number of PBs, or, the opposition between Pro-rich coils to those enriched in Ser, Thr, Asn and Glu. These results are so useful to analyze the sequence-structure relationships, but could also be used to improve fragment-based method for protein structure prediction from sequence.


Subject(s)
Amino Acid Sequence , Protein Structure, Secondary , Amino Acid Motifs , Computational Biology/methods , Models, Molecular , Peptide Library , Sequence Analysis, Protein/methods
2.
Bioinformatics ; 22(11): 1359-66, 2006 Jun 01.
Article in English | MEDLINE | ID: mdl-16527831

ABSTRACT

MOTIVATION: Molecular evolution, which is classically assessed by comparison of individual proteins or genes between species, can now be studied by comparing co-expressed functional groups of genes. This approach, which better reflects the functional constraints on the evolution of organisms, can exploit the large amount of data generated by genome-wide expression analyses. However, it requires new methodologies to represent the data in a more accessible way for cross-species comparisons. RESULTS: In this work, we present an approach based on Multi-dimensional Scaling techniques, to compare the conformation of two gene expression networks, represented in a multi-dimensional space. The expression networks are optimally superimposed, taking into account two criteria: (1) inter-organism orthologous gene pairs have to be nearby points in the final multi-dimensional space and (2) the distortion of the gene expression networks, the organization of which reflects the similarities between the gene expression measurements, has to be circumscribed. Using this approach, we compared the transcriptional programs that drive sporulation in budding and fission yeasts, extracting some common properties and differences between the two species.


Subject(s)
Computational Biology/methods , Gene Expression Profiling/methods , Evolution, Molecular , Fungal Proteins/chemistry , Gene Expression Regulation, Fungal , Genome , Models, Statistical , Saccharomyces cerevisiae/metabolism , Schizosaccharomyces/metabolism , Software , Species Specificity
3.
Proteins ; 62(4): 865-80, 2006 Mar 01.
Article in English | MEDLINE | ID: mdl-16385557

ABSTRACT

We developed a novel approach for predicting local protein structure from sequence. It relies on the Hybrid Protein Model (HPM), an unsupervised clustering method we previously developed. This model learns three-dimensional protein fragments encoded into a structural alphabet of 16 protein blocks (PBs). Here, we focused on 11-residue fragments encoded as a series of seven PBs and used HPM to cluster them according to their local similarities. We thus built a library of 120 overlapping prototypes (mean fragments from each cluster), with good three-dimensional local approximation, i.e., a mean accuracy of 1.61 A Calpha root-mean-square distance. Our prediction method is intended to optimize the exploitation of the sequence-structure relations deduced from this library of long protein fragments. This was achieved by setting up a system of 120 experts, each defined by logistic regression to optimize the discrimination from sequence of a given prototype relative to the others. For a target sequence window, the experts computed probabilities of sequence-structure compatibility for the prototypes and ranked them, proposing the top scorers as structural candidates. Predictions were defined as successful when a prototype <2.5 A from the true local structure was found among those proposed. Our strategy yielded a prediction rate of 51.2% for an average of 4.2 candidates per sequence window. We also proposed a confidence index to estimate prediction quality. Our approach predicts from sequence alone and will thus provide valuable information for proteins without structural homologs. Candidates will also contribute to global structure prediction by fragment assembly.


Subject(s)
Proteins/chemistry , Amino Acid Sequence , Models, Molecular , Models, Theoretical , Peptide Library , Protein Conformation , Protein Multimerization , Software
4.
Bioinformatics ; 22(2): 129-33, 2006 Jan 15.
Article in English | MEDLINE | ID: mdl-16301202

ABSTRACT

MOTIVATION: The object of this study is to propose a new method to identify small compact units that compose protein three-dimensional structures. These fragments, called 'protein units (PU)', are a new level of description to well understand and analyze the organization of protein structures. The method only works from the contact probability matrix, i.e. the inter Calpha-distances translated into probabilities. It uses the principle of conventional hierarchical clustering, leading to a series of nested partitions of the 3D structure. Every step aims at dividing optimally a unit into 2 or 3 subunits according to a criterion called 'partition index' assessing the structural independence of the subunits newly defined. Moreover, an entropy-derived squared correlation R is used for assessing globally the protein structure dissection. The method is compared to other splitting algorithms and shows relevant performance. AVAILABILITY: An Internet server with dedicated tools is available at http://www.ebgm.jussieu.fr/~gelly/


Subject(s)
Algorithms , Models, Chemical , Models, Molecular , Peptide Fragments/chemistry , Peptide Mapping/methods , Proteins/chemistry , Sequence Analysis, Protein/methods , Amino Acid Sequence , Computer Simulation , Models, Statistical , Molecular Sequence Data , Peptide Fragments/analysis , Protein Conformation , Protein Folding , Protein Subunits , Proteins/analysis , Software
5.
Proteins ; 59(4): 810-27, 2005 Jun 01.
Article in English | MEDLINE | ID: mdl-15822101

ABSTRACT

Three-dimensional protein structures can be described with a library of 3D fragments that define a structural alphabet. We have previously proposed such an alphabet, composed of 16 patterns of five consecutive amino acids, called Protein Blocks (PBs). These PBs have been used to describe protein backbones and to predict local structures from protein sequences. The Q16 prediction rate reaches 40.7% with an optimization procedure. This article examines two aspects of PBs. First, we determine the effect of the enlargement of databanks on their definition. The results show that the geometrical features of the different PBs are preserved (local RMSD value equal to 0.41 A on average) and sequence-structure specificities reinforced when databanks are enlarged. Second, we improve the methods for optimizing PB predictions from sequences, revisiting the optimization procedure and exploring different local prediction strategies. Use of a statistical optimization procedure for the sequence-local structure relation improves prediction accuracy by 8% (Q16 = 48.7%). Better recognition of repetitive structures occurs without losing the prediction efficiency of the other local folds. Adding secondary structure prediction improved the accuracy of Q16 by only 1%. An entropy index (Neq), strongly related to the RMSD value of the difference between predicted PBs and true local structures, is proposed to estimate prediction quality. The Neq is linearly correlated with the Q16 prediction rate distributions, computed for a large set of proteins. An "expected" prediction rate QE16 is deduced with a mean error of 5%.


Subject(s)
Peptide Fragments/chemistry , Protein Structure, Secondary , Proteins/chemistry , Amino Acid Sequence , Bayes Theorem , Databases, Protein , Probability , Protein Conformation , Reproducibility of Results
6.
Am J Phys Anthropol ; 125(2): 175-92, 2004 Oct.
Article in English | MEDLINE | ID: mdl-15365983

ABSTRACT

This study investigates the GM genetic relationships of 82 human populations, among which 10 represent original data, within and among the main broad geographic areas of the world. Different approaches are used: multidimensional scaling analysis and test for isolation by distance, to assess the correlation between genetic variation and spatial distributions; analysis of variance, to investigate the genetic structure at different hierarchical levels of population subdivision; genetic similarity map (geographic map distorted by available genetic information), to identify regions of high and low genetic variation; and minimal spanning network, to point out possible migration routes across continental areas. The results show that the GM polymorphism is characterized by one of the highest amounts of genetic variation observed so far among populations of different continents (Fct=0.3915, P < 0.0001). GM diversity can be explained by a model of isolation by distance (IBD) at most continental levels, with a particularly significant fit to IBD for the Middle East and Europe. Five peripheral regions of the world (Europe, west and south sub-Saharan Africa, Southeast Asia, and America) exhibit a low level of genetic diversity both within and among populations. By contrast, East and North African, Southwest Asian, and Northeast Asian populations are highly diverse and interconnected genetically by large genetic distances. Therefore, the observed GM variation can be explained by a "centrifugal model" of modern humans peopling history, involving ancient dispersals across a large intercontinental area spanning from East Africa to Northeast Asia, followed by recent migrations in peripheral geographic regions.


Subject(s)
Emigration and Immigration , Genetic Variation , Genetics, Population , Haplotypes/genetics , Models, Biological , Demography , Ethnicity/genetics , Geography , Humans , Population Dynamics
7.
BMC Bioinformatics ; 5: 114, 2004 Aug 23.
Article in English | MEDLINE | ID: mdl-15324460

ABSTRACT

BACKGROUND: Microarray technologies produced large amount of data. The hierarchical clustering is commonly used to identify clusters of co-expressed genes. However, microarray datasets often contain missing values (MVs) representing a major drawback for the use of the clustering methods. Usually the MVs are not treated, or replaced by zero or estimated by the k-Nearest Neighbor (kNN) approach. The topic of the paper is to study the stability of gene clusters, defined by various hierarchical clustering algorithms, of microarrays experiments including or not MVs. RESULTS: In this study, we show that the MVs have important effects on the stability of the gene clusters. Moreover, the magnitude of the gene misallocations is depending on the aggregation algorithm. The most appropriate aggregation methods (e.g. complete-linkage and Ward) are highly sensitive to MVs, and surprisingly, for a very tiny proportion of MVs (e.g. 1%). In most of the case, the MVs must be replaced by expected values. The MVs replacement by the kNN approach clearly improves the identification of co-expressed gene clusters. Nevertheless, we observe that kNN approach is less suitable for the extreme values of gene expression. CONCLUSION: The presence of MVs (even at a low rate) is a major factor of gene cluster instability. In addition, the impact depends on the hierarchical clustering algorithm used. Some methods should be used carefully. Nevertheless, the kNN approach constitutes one efficient method for restoring the missing expression gene values, with a low error level. Our study highlights the need of statistical treatments in microarray data to avoid misinterpretation.


Subject(s)
Gene Expression Profiling/statistics & numerical data , Gene Expression Regulation, Fungal/genetics , Genes, Fungal/genetics , Oligonucleotide Array Sequence Analysis/statistics & numerical data , Cluster Analysis , Computational Biology/statistics & numerical data , Databases, Genetic , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Reference Values , Research Design/statistics & numerical data , Saccharomyces cerevisiae/genetics
8.
In Silico Biol ; 4(3): 381-6, 2004.
Article in English | MEDLINE | ID: mdl-15724288

ABSTRACT

A statistical analysis of the PDB structures has led us to define a new set of small 3D structural prototypes called Protein Blocks (PBs). This structural alphabet includes 16 PBs, each one is defined by the (phi, psi) dihedral angles of 5 consecutive residues. The amino acid distributions observed in sequence windows encompassing these PBs are used to predict by a Bayesian approach the local 3D structure of proteins from the sole knowledge of their sequences. LocPred is a software which allows the users to submit a protein sequence and performs a prediction in terms of PBs. The prediction results are given both textually and graphically.


Subject(s)
Proteins/chemistry , Amino Acid Sequence , Models, Molecular , Molecular Sequence Data , Protein Conformation
9.
Protein Sci ; 11(12): 2871-86, 2002 Dec.
Article in English | MEDLINE | ID: mdl-12441385

ABSTRACT

Protein Blocks (PBs) comprise a structural alphabet of 16 protein fragments, each 5 Calpha long. They make it possible to approximate and correctly predict local protein three-dimensional (3D) structures. We have selected the 72 most frequent sequences of five PBs, which we call Structural Words (SWs). Analysis of four different protein data banks shows that SWs cover 92% of the amino acids in them and provide a good structural approximation for residues (i.e., sequences) 9 Calpha long. We present most of them in a simple network that describes 90% of the overall residues and, interestingly, includes more than 80% of the amino acids present in coils. Analysis of the network shows the specificity and quality of the 3D descriptions as well as a new type of relation between local folds and amino acid distribution. The results show that the 3D structure of these protein data banks can be easily described by a combination of subgraphs included in the network. Finally, a Bayesian probabilistic approach improved the prediction rate by 4%.


Subject(s)
Computational Biology/methods , Models, Molecular , Peptide Fragments/chemistry , Amino Acid Sequence , Bayes Theorem , Computer Simulation , Databases, Protein , Methionine-tRNA Ligase/chemistry , Molecular Sequence Data , Protein Conformation , Thermodynamics , Thermus thermophilus/enzymology
10.
Comput Chem ; 26(5): 437-45, 2002 Jul.
Article in English | MEDLINE | ID: mdl-12144174

ABSTRACT

The aim of this paper is to present a new approach, called 'Hybrid Chromosome Model' (HXM), which allows both the extraction of regions of similarity between two sequences, and the compartimentation of a set of DNA sequences. The principle of the method consists in compacting a set of sequences (split into fragments of fixed length) into a 'hybrid chromosome', which results from the stacking of the whole sequence fragments. We have illustrated our approach on the 32 subtelomeres of Saccharomyces cerevisae. The compartimentation of these chromosome extremities into common regions of similarity has been carried out. The approach HXM is a fast and efficient tool for mapping entire genomes and for extracting ancient duplications within or between genomes.


Subject(s)
Chromosomes, Fungal/genetics , Genome, Fungal , Physical Chromosome Mapping/methods , Saccharomyces cerevisiae/genetics , Telomere/genetics , Base Sequence , Databases, Nucleic Acid , Sequence Alignment/methods
11.
Bioinformatics ; 18(3): 446-51, 2002 Mar.
Article in English | MEDLINE | ID: mdl-11934744

ABSTRACT

MOTIVATION: Locating the regions of similarity in a genome requires the availability of appropriate tools such as 'Accelerated Search for SImilar Regions in Chromosomes' (ASSIRC; Vincens et al., Bioinformatics, 14, 715-725, 1998). The aim of this paper is to present different strategies for improving this program by distributing the operations and data to multiple processing units and to assess the efficiency of the different implementations in terms of running time as a function of the number of processing units. RESULTS: The new version D-ASSIRCis based on three alternative strategies of task sharing: (1) a distributed search using the splitting of studied sequences into large overlapping subsequences (strategy ASS); (2) two distributed searches for repeated exact motifs of fixed size either managed by a central processor (strategy AGD) or locally managed by numerous processors (strategy ALD). The result is that the strategy ASSis suitable for a large number of processing units (the time was divided by a factor of 12 when the number of processing units was increased from 1 to 16) wheras the strategy ALDis better for a small set of processors (typically for four or six). The different proposed strategies are efficient for various applications in genomic research, particularly for locating similarities of nucleic sequences in large genomes. AVAILABILITY: D-ASSIRCis freely available by anonymous FTP at ftp://ftp.ens.fr/pub/molbio/dassirc.tar.gz. Sources and binaries for Solaris and Linux are included in the distribution.


Subject(s)
Algorithms , DNA/genetics , Genome , Sequence Alignment/methods , Sequence Homology , Software , Base Sequence , Computing Methodologies , Databases, Nucleic Acid , Molecular Sequence Data , National Library of Medicine (U.S.) , Quality Control , Saccharomyces cerevisiae/genetics , Sensitivity and Specificity , Time Factors , United States
SELECTION OF CITATIONS
SEARCH DETAIL
...