Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters










Database
Language
Publication year range
1.
PLoS Comput Biol ; 19(10): e1011521, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37883593

ABSTRACT

Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.


Subject(s)
Biological Evolution , Proteins , Proteins/genetics , Mutagenesis , Mutation , Computer Simulation , Genetic Fitness/genetics , Models, Genetic
2.
Phys Rev E ; 101(1-1): 012309, 2020 Jan.
Article in English | MEDLINE | ID: mdl-32069678

ABSTRACT

We consider the problem of inferring a graphical Potts model on a population of variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization scheme, in which the number of Potts states (colors) available to each variable is reduced and interaction networks are made sparse. To achieve the color compression, only Potts states with large empirical frequency (exceeding some threshold) are explicitly modeled on each site, while the others are grouped into a single state. We benchmark the performances of this mixed regularization approach, with two inference algorithms, adaptive cluster expansion (ACE) and pseudolikelihood maximization (PLM), on synthetic data obtained by sampling disordered Potts models on Erdos-Rényi random graphs. We show in particular that color compression does not affect the quality of reconstruction of the parameters corresponding to high-frequency symbols, while drastically reducing the number of the other parameters and thus the computational time. Our procedure is also applied to multisequence alignments of protein families, with similar results.

3.
Sci Rep ; 9(1): 18032, 2019 12 02.
Article in English | MEDLINE | ID: mdl-31792239

ABSTRACT

We introduce a simple model that describes the average occurrence of point variations in a generic protein sequence. This model is based on the idea that mutations are more likely to be fixed at sites in contact with others that have mutated in the recent past. Therefore, we extend the usual assumptions made in protein coevolution by introducing a time dumping on the effect of a substitution on its surrounding and makes correlated substitutions happen in avalanches localized in space and time. The model correctly predicts the average correlation of substitutions as a function of their distance along the sequence. At the same time, it predicts an among-site distribution of the number of substitutions per site highly compatible with a negative binomial, consistently with experimental data. The promising outcomes achieved with this model encourage the application of the same ideas in the field of pairwise and multiple sequence alignment.


Subject(s)
Amino Acid Sequence/genetics , Evolution, Molecular , Models, Genetic , Amino Acid Substitution , Codon/genetics , Humans , Point Mutation , Sequence Alignment
4.
J Phys Chem Lett ; 10(7): 1489-1497, 2019 Apr 04.
Article in English | MEDLINE | ID: mdl-30855965

ABSTRACT

Life machinery, although overwhelmingly complex, is rooted on a rather limited number of molecular processes. One of the most important is protein-protein interaction. Metabolic regulation, protein folding control, and cellular motility are examples of processes based on the fine-tuned interaction of several protein partners. The region on the protein surface devoted to the recognition of a specific partner is essential for the function of the protein and is, therefore, likely to be conserved during evolution. On the other hand, the physical chemistry of amino acids underlies the mechanism of interactions. Both evolutionary and energetic constraints can then be used to build scoring functions capable of recognizing interaction sites. Our working hypothesis is that residues within the interaction interface tend at the same time to be evolutionarily conserved (to preserve their function) and to provide little contribution to the internal stabilization of the structure of their cognate protein, to facilitate conformational adaptation to the partner. Here, we show that for some classes of protein partners (for example, those involved in signal transduction and in enzymes) evolutionary constraints play the key role in defining the interaction surface. In contrast, energetic constraints emerge as more important in protein partners involved in immune response, in inhibitor proteins, and in structural proteins. Our results indicate that a general-purpose scoring function for protein-protein interaction should not be agnostic of the biological function of the partners.


Subject(s)
Proteins/chemistry , Adaptive Immunity , Evolution, Molecular , Models, Molecular , Protein Binding , Protein Conformation , Protein Folding , Proteins/metabolism , Signal Transduction
5.
Genetics ; 207(2): 643-652, 2017 10.
Article in English | MEDLINE | ID: mdl-28754661

ABSTRACT

Fast genome sequencing offers invaluable opportunities for building updated and improved models of protein sequence evolution. We here show that Single Nucleotide Polymorphisms (SNPs) can be used to build a model capable of predicting the probability of substitution between amino acids in variants of the same protein in different species. The model is based on a substitution matrix inferred from the frequency of codon interchanges observed in a suitably selected subset of human SNPs, and predicts the substitution probabilities observed in alignments between Homo sapiens and related species at 85-100% of sequence identity better than any other approach we are aware of. The model gradually loses its predictive power at lower sequence identity. Our results suggest that SNPs can be employed, together with multiple sequence alignment data, to model protein sequence evolution. The SNP-based substitution matrix developed in this work can be exploited to better align protein sequences of related organisms, to refine the estimate of the evolutionary distance between protein variants from related species in phylogenetic trees and, in perspective, might become a useful tool for population analysis.


Subject(s)
Models, Genetic , Polymorphism, Single Nucleotide , Amino Acid Substitution , Evolution, Molecular , Genome, Human , Humans , Probability , Sequence Alignment
6.
BMC Bioinformatics ; 17: 258, 2016 Jun 24.
Article in English | MEDLINE | ID: mdl-27342318

ABSTRACT

BACKGROUND: Many models of protein sequence evolution, in particular those based on Point Accepted Mutation (PAM) matrices, assume that its dynamics is Markovian. Nevertheless, it has been observed that evolution seems to proceed differently at different time scales, questioning this assumption. In 2011 Kosiol and Goldman proved that, if evolution is Markovian at the codon level, it can not be Markovian at the amino acid level. However, it remains unclear up to which point the Markov assumption is verified at the codon level. RESULTS: Here we show how also the among-site variability of substitution rates makes the process of full protein sequence evolution effectively not Markovian even at the codon level. This may be the theoretical explanation behind the well known systematic underestimation of evolutionary distances observed when omitting rate variability. If the substitution rate variability is neglected the average amino acid and codon replacement probabilities are affected by systematic errors and those with the largest mismatches are the substitutions involving more than one nucleotide at a time. On the other hand, the instantaneous substitution matrices estimated from alignments with the Markov assumption tend to overestimate double and triple substitutions, even when learned from alignments at high sequence identity. CONCLUSIONS: These results discourage the use of simple Markov models to describe full protein sequence evolution and encourage to employ, whenever possible, models that account for rate variability by construction (such as hidden Markov models or mixture models) or substitution models of the type of Le and Gascuel (2008) that account for it explicitly.


Subject(s)
Evolution, Molecular , Models, Genetic , Proteins/genetics , Amino Acid Sequence , Codon , Markov Chains , Proteins/chemistry
SELECTION OF CITATIONS
SEARCH DETAIL
...