Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 12 de 12
Filter
Add more filters










Publication year range
1.
PLoS Comput Biol ; 20(2): e1011812, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38377054

ABSTRACT

The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples.


Subject(s)
Genetic Fitness , Machine Learning , Animals , Mice , Mutation , Genetic Fitness/genetics
3.
Nat Commun ; 12(1): 5800, 2021 10 04.
Article in English | MEDLINE | ID: mdl-34608136

ABSTRACT

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 102 and 103). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model's entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 1068 possible sequences, which nevertheless constitute only the astronomically small fraction 10-80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.


Subject(s)
Models, Statistical , Proteins/chemistry , Amino Acid Sequence , Computational Biology , Databases, Protein , Epistasis, Genetic , Evolution, Molecular , Machine Learning , Mutation , Proteins/classification , Proteins/genetics , Sequence Alignment
4.
Int J Mol Sci ; 22(20)2021 Oct 09.
Article in English | MEDLINE | ID: mdl-34681569

ABSTRACT

We present Annealed Mutational approximated Landscape (AMaLa), a new method to infer fitness landscapes from Directed Evolution experiments sequencing data. Such experiments typically start from a single wild-type sequence, which undergoes Darwinian in vitro evolution via multiple rounds of mutation and selection for a target phenotype. In the last years, Directed Evolution is emerging as a powerful instrument to probe fitness landscapes under controlled experimental conditions and as a relevant testing ground to develop accurate statistical models and inference algorithms (thanks to high-throughput screening and sequencing). Fitness landscape modeling either uses the enrichment of variants abundances as input, thus requiring the observation of the same variants at different rounds or assuming the last sequenced round as being sampled from an equilibrium distribution. AMaLa aims at effectively leveraging the information encoded in the whole time evolution. To do so, while assuming statistical sampling independence between sequenced rounds, the possible trajectories in sequence space are gauged with a time-dependent statistical weight consisting of two contributions: (i) an energy term accounting for the selection process and (ii) a generalized Jukes-Cantor model for the purely mutational step. This simple scheme enables accurately describing the Directed Evolution dynamics and inferring a fitness landscape that correctly reproduces the measures of the phenotype under selection (e.g., antibiotic drug resistance), notably outperforming widely used inference strategies. In addition, we assess the reliability of AMaLa by showing how the inferred statistical model could be used to predict relevant structural properties of the wild-type sequence.


Subject(s)
Computational Biology/methods , Directed Molecular Evolution/methods , Mutation , Algorithms , Evolution, Molecular , Genetic Fitness , High-Throughput Nucleotide Sequencing , Models, Genetic , Sequence Analysis, DNA
5.
Mol Biol Evol ; 38(1): 318-328, 2021 01 04.
Article in English | MEDLINE | ID: mdl-32770229

ABSTRACT

The recent technological advances underlying the screening of large combinatorial libraries in high-throughput mutational scans deepen our understanding of adaptive protein evolution and boost its applications in protein design. Nevertheless, the large number of possible genotypes requires suitable computational methods for data analysis, the prediction of mutational effects, and the generation of optimized sequences. We describe a computational method that, trained on sequencing samples from multiple rounds of a screening experiment, provides a model of the genotype-fitness relationship. We tested the method on five large-scale mutational scans, yielding accurate predictions of the mutational effects on fitness. The inferred fitness landscape is robust to experimental and sampling noise and exhibits high generalization power in terms of broader sequence space exploration and higher fitness variant predictions. We investigate the role of epistasis and show that the inferred model provides structural information about the 3D contacts in the molecular fold.


Subject(s)
Evolution, Molecular , Genetic Fitness , Epistasis, Genetic , Mutation , Unsupervised Machine Learning
6.
PLoS Comput Biol ; 16(5): e1007866, 2020 05.
Article in English | MEDLINE | ID: mdl-32421707

ABSTRACT

The precise diagnostics of complex diseases require to integrate a large amount of information from heterogeneous clinical and biomedical data, whose direct and indirect interdependences are notoriously difficult to assess. To this end, we propose an efficient computational approach to simultaneously compute and assess the significance of multivariate information between any combination of mixed-type (continuous/categorical) variables. The method is then used to uncover direct, indirect and possibly causal relationships between mixed-type data from medical records, by extending a recent machine learning method to reconstruct graphical models beyond simple categorical datasets. The method is shown to outperform existing tools on benchmark mixed-type datasets, before being applied to analyze the medical records of eldery patients with cognitive disorders from La Pitié-Salpêtrière Hospital, Paris. The resulting clinical network visually captures the global interdependences in these medical records and some facets of clinical diagnosis practice, without specific hypothesis nor prior knowledge on any clinically relevant information. In particular, it provides some physiological insights linking the consequence of cerebrovascular accidents to the atrophy of important brain structures associated to cognitive impairment.


Subject(s)
Learning , Medical Records , Algorithms , Datasets as Topic , Humans , Machine Learning , Paris
7.
Bioinformatics ; 34(13): 2311-2313, 2018 07 01.
Article in English | MEDLINE | ID: mdl-29300827

ABSTRACT

Summary: We present a web server running the MIIC algorithm, a network learning method combining constraint-based and information-theoretic frameworks to reconstruct causal, non-causal or mixed networks from non-perturbative data, without the need for an a priori choice on the class of reconstructed network. Starting from a fully connected network, the algorithm first removes dispensable edges by iteratively subtracting the most significant information contributions from indirect paths between each pair of variables. The remaining edges are then filtered based on their confidence assessment or oriented based on the signature of causality in observational data. MIIC online server can be used for a broad range of biological data, including possible unobserved (latent) variables, from single-cell gene expression data to protein sequence evolution and outperforms or matches state-of-the-art methods for either causal or non-causal network reconstruction. Availability and implementation: MIIC online can be freely accessed at https://miic.curie.fr. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Neural Networks, Computer , Algorithms , Computers , Software
8.
Proc Natl Acad Sci U S A ; 114(13): E2662-E2671, 2017 03 28.
Article in English | MEDLINE | ID: mdl-28289198

ABSTRACT

Proteins have evolved to perform diverse cellular functions, from serving as reaction catalysts to coordinating cellular propagation and development. Frequently, proteins do not exert their full potential as monomers but rather undergo concerted interactions as either homo-oligomers or with other proteins as hetero-oligomers. The experimental study of such protein complexes and interactions has been arduous. Theoretical structure prediction methods are an attractive alternative. Here, we investigate homo-oligomeric interfaces by tracing residue coevolution via the global statistical direct coupling analysis (DCA). DCA can accurately infer spatial adjacencies between residues. These adjacencies can be included as constraints in structure prediction techniques to predict high-resolution models. By taking advantage of the ongoing exponential growth of sequence databases, we go significantly beyond anecdotal cases of a few protein families and apply DCA to a systematic large-scale study of nearly 2,000 Pfam protein families with sufficient sequence information and structurally resolved homo-oligomeric interfaces. We find that large interfaces are commonly identified by DCA. We further demonstrate that DCA can differentiate between subfamilies with different binding modes within one large Pfam family. Sequence-derived contact information for the subfamilies proves sufficient to assemble accurate structural models of the diverse protein-oligomers. Thus, we provide an approach to investigate oligomerization for arbitrary protein families leading to structural models complementary to often-difficult experimental methods. Combined with ever more abundant sequential data, we anticipate that this study will be instrumental to allow the structural description of many heteroprotein complexes in the future.


Subject(s)
Evolution, Molecular , Proteins/chemistry , Databases, Protein , Models, Molecular , Molecular Biology/methods , Protein Conformation , Protein Interaction Domains and Motifs , Proteins/metabolism
9.
J Chem Phys ; 145(17): 174102, 2016 Nov 07.
Article in English | MEDLINE | ID: mdl-27825220

ABSTRACT

Coevolution of residues in contact imposes strong statistical constraints on the sequence variability between homologous proteins. Direct-Coupling Analysis (DCA), a global statistical inference method, successfully models this variability across homologous protein families to infer structural information about proteins. For each residue pair, DCA infers 21 × 21 matrices describing the coevolutionary coupling for each pair of amino acids (or gaps). To achieve the residue-residue contact prediction, these matrices are mapped onto simple scalar parameters; the full information they contain gets lost. Here, we perform a detailed spectral analysis of the coupling matrices resulting from 70 protein families, to show that they contain quantitative information about the physico-chemical properties of amino-acid interactions. Results for protein families are corroborated by the analysis of synthetic data from lattice-protein models, which emphasizes the critical effect of sampling quality and regularization on the biochemical features of the statistical coupling matrices.


Subject(s)
Biophysical Phenomena , Evolution, Molecular , Models, Molecular , Proteins/chemistry , Proteins/metabolism , Entropy , Protein Folding , Solvents/chemistry
10.
PLoS Comput Biol ; 12(4): e1004870, 2016 Apr.
Article in English | MEDLINE | ID: mdl-27074145

ABSTRACT

The immune system has developed a number of distinct complex mechanisms to shape and control the antibody repertoire. One of these mechanisms, the affinity maturation process, works in an evolutionary-like fashion: after binding to a foreign molecule, the antibody-producing B-cells exhibit a high-frequency mutation rate in the genome region that codes for the antibody active site. Eventually, cells that produce antibodies with higher affinity for their cognate antigen are selected and clonally expanded. Here, we propose a new statistical approach based on maximum entropy modeling in which a scoring function related to the binding affinity of antibodies against a specific antigen is inferred from a sample of sequences of the immune repertoire of an individual. We use our inference strategy to infer a statistical model on a data set obtained by sequencing a fairly large portion of the immune repertoire of an HIV-1 infected patient. The Pearson correlation coefficient between our scoring function and the IC50 neutralization titer measured on 30 different antibodies of known sequence is as high as 0.77 (p-value 10-6), outperforming other sequence- and structure-based models.


Subject(s)
Antibody Affinity/physiology , Antigen-Antibody Reactions/physiology , Models, Immunological , Antibodies, Neutralizing/chemistry , Antibodies, Neutralizing/genetics , Antibodies, Neutralizing/metabolism , Antibody Affinity/genetics , Antigen-Antibody Reactions/genetics , B-Lymphocytes/immunology , Binding Sites, Antibody/genetics , Binding Sites, Antibody/physiology , Cluster Analysis , Computational Biology , Computer Simulation , Entropy , Evolution, Molecular , HIV Antibodies/chemistry , HIV Antibodies/genetics , HIV Antibodies/metabolism , HIV Infections/genetics , HIV Infections/immunology , HIV-1/immunology , Humans , Models, Molecular , Mutation , Normal Distribution , Sequence Alignment
11.
Sci Rep ; 3: 3458, 2013 Dec 10.
Article in English | MEDLINE | ID: mdl-24322327

ABSTRACT

In this work we aim to highlight a close analogy between cooperative behaviors in chemical kinetics and cybernetics; this is realized by using a common language for their description, that is mean-field statistical mechanics. First, we perform a one-to-one mapping between paradigmatic behaviors in chemical kinetics (i.e., non-cooperative, cooperative, ultra-sensitive, anti-cooperative) and in mean-field statistical mechanics (i.e., paramagnetic, high and low temperature ferromagnetic, anti-ferromagnetic). Interestingly, the statistical mechanics approach allows a unified, broad theory for all scenarios and, in particular, Michaelis-Menten, Hill and Adair equations are consistently recovered. This framework is then tested against experimental biological data with an overall excellent agreement. One step forward, we consistently read the whole mapping from a cybernetic perspective, highlighting deep structural analogies between the above-mentioned kinetics and fundamental bricks in electronics (i.e. operational amplifiers, flashes, flip-flops), so to build a clear bridge linking biochemical kinetics and cybernetics.


Subject(s)
Algorithms , Models, Theoretical , Humans
12.
PLoS One ; 7(7): e39849, 2012.
Article in English | MEDLINE | ID: mdl-22815715

ABSTRACT

Within a fully microscopic setting, we derive a variational principle for the non-equilibrium steady states of chemical reaction networks, valid for time-scales over which chemical potentials can be taken to be slowly varying: at stationarity the system minimizes a global function of the reaction fluxes with the form of a Hopfield Hamiltonian with hebbian couplings, that is explicitly seen to correspond to the rate of decay of entropy production over time. Guided by this analogy, we show that reaction networks can be formally re-cast as systems of interacting reactions that optimize the use of the available compounds by competing for substrates, akin to agents competing for a limited resource in an optimal allocation problem. As an illustration, we analyze the scenario that emerges in two simple cases: that of toy (random) reaction networks and that of a metabolic network model of the human red blood cell.


Subject(s)
Models, Chemical , Erythrocytes/metabolism , Humans , Metabolic Networks and Pathways , Stochastic Processes , Thermodynamics
SELECTION OF CITATIONS
SEARCH DETAIL
...