Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 48
Filter
1.
PLoS Comput Biol ; 19(11): e1011621, 2023 Nov.
Article in English | MEDLINE | ID: mdl-37976326

ABSTRACT

We present here an approach to protein design that combines (i) scarce functional information such as experimental data (ii) evolutionary information learned from a natural sequence variants and (iii) physics-grounded modeling. Using a Restricted Boltzmann Machine (RBM), we learn a sequence model of a protein family. We use semi-supervision to leverage available functional information during the RBM training. We then propose a strategy to explore the protein representation space that can be informed by external models such as an empirical force-field method (FoldX). Our approach is applied to a domain of the Cas9 protein responsible for recognition of a short DNA motif. We experimentally assess the functionality of 71 variants generated to explore a range of RBM and FoldX energies. Sequences with as many as 50 differences (20% of the protein domain) to the wild-type retained functionality. Overall, 21/71 sequences designed with our method were functional. Interestingly, 6/71 sequences showed an improved activity in comparison with the original wild-type protein sequence. These results demonstrate the interest in further exploring the synergies between machine-learning of protein sequence representations and physics grounded modeling strategies informed by structural information.


Subject(s)
CRISPR-Cas Systems , Proteins , Proteins/genetics , Proteins/chemistry , Amino Acid Sequence , Machine Learning , Learning
2.
PLoS Comput Biol ; 19(10): e1011521, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37883593

ABSTRACT

Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.


Subject(s)
Biological Evolution , Proteins , Proteins/genetics , Mutagenesis , Mutation , Computer Simulation , Genetic Fitness/genetics , Models, Genetic
3.
Elife ; 122023 09 08.
Article in English | MEDLINE | ID: mdl-37681658

ABSTRACT

Antigen immunogenicity and the specificity of binding of T-cell receptors to antigens are key properties underlying effective immune responses. Here we propose diffRBM, an approach based on transfer learning and Restricted Boltzmann Machines, to build sequence-based predictive models of these properties. DiffRBM is designed to learn the distinctive patterns in amino-acid composition that, on the one hand, underlie the antigen's probability of triggering a response, and on the other hand the T-cell receptor's ability to bind to a given antigen. We show that the patterns learnt by diffRBM allow us to predict putative contact sites of the antigen-receptor complex. We also discriminate immunogenic and non-immunogenic antigens, antigen-specific and generic receptors, reaching performances that compare favorably to existing sequence-based predictors of antigen immunogenicity and T-cell receptor specificity.


Subject(s)
Amino Acids , Learning , T-Cell Antigen Receptor Specificity , Cell Membrane , Mitochondrial Membranes
4.
Phys Rev E ; 108(2-1): 024141, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37723761

ABSTRACT

We study transition paths in energy landscapes over multicategorical Potts configurations using the mean-field approach introduced by Mauri et al. [Phys. Rev. Lett. 130, 158402 (2023)0031-900710.1103/PhysRevLett.130.158402]. Paths interpolate between two fixed configurations or are anchored at one extremity only. We characterize the properties of "good" transition paths realizing a trade-off between exploring low-energy regions in the landscape and being not too long, such as their entropy or the probability of escape from a region of the landscape. We unveil the existence of a phase transition separating a regime in which paths are stretched in between their anchors from another regime where paths can explore the energy landscape more globally to minimize the energy. This phase transition is first illustrated and studied in detail on a mathematically tractable Hopfield-Potts toy model, then studied in energy landscapes inferred from protein sequence data.

5.
Phys Rev Lett ; 130(15): 158402, 2023 Apr 14.
Article in English | MEDLINE | ID: mdl-37115874

ABSTRACT

Identifying and characterizing mutational paths is an important issue in evolutionary biology, with potential applications to bioengineering. We here propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins in silico, and apply to data-driven models of natural proteins learned from sequence data with restricted Boltzmann machines. We then use mean-field theory to characterize paths for different mutational dynamics of interest, and to extend Kimura's estimate of evolutionary distances to sequence-based epistatic models of selection.


Subject(s)
Biological Evolution , Proteins , Mutation , Proteins/genetics , Algorithms
6.
Curr Opin Struct Biol ; 80: 102571, 2023 06.
Article in English | MEDLINE | ID: mdl-36947951

ABSTRACT

Computational protein design facilitates the discovery of novel proteins with prescribed structure and functionality. Exciting designs were recently reported using novel data-driven methodologies that can be roughly divided into two categories: evolutionary-based and physics-inspired approaches. The former infer characteristic sequence features shared by sets of evolutionary-related proteins, such as conserved or coevolving positions, and recombine them to generate candidates with similar structure and function. The latter approaches estimate key biochemical properties, such as structure free energy, conformational entropy, or binding affinities using machine learning surrogates, and optimize them to yield improved designs. Here, we review recent progress along both tracks, discuss their strengths and weaknesses, and highlight opportunities for synergistic approaches.


Subject(s)
Machine Learning , Proteins , Proteins/chemistry , Physics , Databases, Protein
7.
Elife ; 122023 03 14.
Article in English | MEDLINE | ID: mdl-36916902

ABSTRACT

Establishing accurate as well as interpretable models of network activity is an open challenge in systems neuroscience. Here, we infer an energy-based model of the anterior rhombencephalic turning region (ARTR), a circuit that controls zebrafish swimming statistics, using functional recordings of the spontaneous activity of hundreds of neurons. Although our model is trained to reproduce the low-order statistics of the network activity at short time scales, its simulated dynamics quantitatively captures the slowly alternating activity of the ARTR. It further reproduces the modulation of this persistent dynamics by the water temperature and visual stimulation. Mathematical analysis of the model unveils a low-dimensional landscape-based representation of the ARTR activity, where the slow network dynamics reflects Arrhenius-like barriers crossings between metastable states. Our work thus shows how data-driven models built from large neural populations recordings can be reduced to low-dimensional functional models in order to reveal the fundamental mechanisms controlling the collective neuronal dynamics.


Subject(s)
Neural Networks, Computer , Zebrafish , Animals , Zebrafish/physiology , Neurons/physiology , Swimming , Photic Stimulation , Models, Neurological
8.
Nucleic Acids Res ; 50(21): 12082-12093, 2022 11 28.
Article in English | MEDLINE | ID: mdl-36478056

ABSTRACT

The hybridization kinetic of an oligonucleotide to its template is a fundamental step in many biological processes such as replication arrest, CRISPR recognition, DNA sequencing, DNA origami, etc. Although single kinetic descriptions exist for special cases of this problem, there are no simple general prediction schemes. In this work, we have measured experimentally, with no fluorescent labelling, the displacement of an oligonucleotide from its substrate in two situations: one corresponding to oligonucleotide binding/unbinding on ssDNA and one in which the oligonucleotide is displaced by the refolding of a dsDNA fork. In this second situation, the fork is expelling the oligonucleotide thus significantly reducing its residence time. To account for our data in these two situations, we have constructed a mathematical model, based on the known nearest neighbour dinucleotide free energies, and provided a good estimate of the residence times of different oligonucleotides (DNA, RNA, LNA) of various lengths in different experimental conditions (force, temperature, buffer conditions, presence of mismatches, etc.). This study provides a foundation for the dynamics of oligonucleotide displacement, a process of importance in numerous biological and bioengineering contexts.


Subject(s)
DNA , Oligonucleotides , DNA/genetics , Nucleic Acid Hybridization , DNA, Single-Stranded , Oligonucleotide Probes
9.
PLoS Comput Biol ; 18(9): e1010561, 2022 09.
Article in English | MEDLINE | ID: mdl-36174101

ABSTRACT

Selection protocols such as SELEX, where molecules are selected over multiple rounds for their ability to bind to a target of interest, are popular methods for obtaining binders for diagnostic and therapeutic purposes. We show that Restricted Boltzmann Machines (RBMs), an unsupervised two-layer neural network architecture, can successfully be trained on sequence ensembles from single rounds of SELEX experiments for thrombin aptamers. RBMs assign scores to sequences that can be directly related to their fitnesses estimated through experimental enrichment ratios. Hence, RBMs trained from sequence data at a given round can be used to predict the effects of selection at later rounds. Moreover, the parameters of the trained RBMs are interpretable and identify functional features contributing most to sequence fitness. To exploit the generative capabilities of RBMs, we introduce two different training protocols: one taking into account sequence counts, capable of identifying the few best binders, and another based on unique sequences only, generating more diverse binders. We then use RBMs model to generate novel aptamers with putative disruptive mutations or good binding properties, and validate the generated sequences with gel shift assay experiments. Finally, we compare the RBM's performance with different supervised learning approaches that include random forests and several deep neural network architectures.


Subject(s)
Neural Networks, Computer , Thrombin , Machine Learning
10.
Nat Commun ; 13(1): 4122, 2022 07 15.
Article in English | MEDLINE | ID: mdl-35840595

ABSTRACT

Episodic memory formation and recall are complementary processes that rely on opposing neuronal computations in the hippocampus. How this conflict is resolved in hippocampal circuits is unclear. To address this question, we obtained in vivo whole-cell patch-clamp recordings from dentate gyrus granule cells in head-fixed mice trained to explore and distinguish between familiar and novel virtual environments. We find that granule cells consistently show a small transient depolarisation upon transition to a novel environment. This synaptic novelty signal is sensitive to local application of atropine, indicating that it depends on metabotropic acetylcholine receptors. A computational model suggests that the synaptic response to novelty may bias granule cell population activity, which can drive downstream attractor networks to a new state, favouring the switch from recall to new memory formation when faced with novelty. Such a novelty-driven switch may enable flexible encoding of new memories while preserving stable retrieval of familiar ones.


Subject(s)
Hippocampus , Memory, Episodic , Animals , Dentate Gyrus/physiology , Hippocampus/physiology , Mental Recall/physiology , Mice , Neurons/physiology
11.
Nature ; 606(7913): 389-395, 2022 06.
Article in English | MEDLINE | ID: mdl-35589842

ABSTRACT

Cancer immunoediting1 is a hallmark of cancer2 that predicts that lymphocytes kill more immunogenic cancer cells to cause less immunogenic clones to dominate a population. Although proven in mice1,3, whether immunoediting occurs naturally in human cancers remains unclear. Here, to address this, we investigate how 70 human pancreatic cancers evolved over 10 years. We find that, despite having more time to accumulate mutations, rare long-term survivors of pancreatic cancer who have stronger T cell activity in primary tumours develop genetically less heterogeneous recurrent tumours with fewer immunogenic mutations (neoantigens). To quantify whether immunoediting underlies these observations, we infer that a neoantigen is immunogenic (high-quality) by two features-'non-selfness'  based on neoantigen similarity to known antigens4,5, and 'selfness'  based on the antigenic distance required for a neoantigen to differentially bind to the MHC or activate a T cell compared with its wild-type peptide. Using these features, we estimate cancer clone fitness as the aggregate cost of T cells recognizing high-quality neoantigens offset by gains from oncogenic mutations. With this model, we predict the clonal evolution of tumours to reveal that long-term survivors of pancreatic cancer develop recurrent tumours with fewer high-quality neoantigens. Thus, we submit evidence that that the human immune system naturally edits neoantigens. Furthermore, we present a model to predict how immune pressure induces cancer cell populations to evolve over time. More broadly, our results argue that the immune system fundamentally surveils host genetic changes to suppress cancer.


Subject(s)
Antigens, Neoplasm , Cancer Survivors , Pancreatic Neoplasms , Antigens, Neoplasm/genetics , Antigens, Neoplasm/immunology , Humans , Pancreatic Neoplasms/genetics , Pancreatic Neoplasms/immunology , Pancreatic Neoplasms/pathology , T-Lymphocytes/immunology , Tumor Escape/immunology
12.
RNA ; 28(3): 277-289, 2022 03.
Article in English | MEDLINE | ID: mdl-34937774

ABSTRACT

Coronavirus RNA-dependent RNA polymerases produce subgenomic RNAs (sgRNAs) that encode viral structural and accessory proteins. User-friendly bioinformatic tools to detect and quantify sgRNA production are urgently needed to study the growing number of next-generation sequencing (NGS) data of SARS-CoV-2. We introduced sgDI-tector to identify and quantify sgRNA in SARS-CoV-2 NGS data. sgDI-tector allowed detection of sgRNA without initial knowledge of the transcription-regulatory sequences. We produced NGS data and successfully detected the nested set of sgRNAs with the ranking M > ORF3a > N>ORF6 > ORF7a > ORF8 > S > E>ORF7b. We also compared the level of sgRNA production with other types of viral RNA products such as defective interfering viral genomes.


Subject(s)
Computational Biology/methods , Genome, Viral , RNA, Viral/genetics , SARS-CoV-2/genetics , High-Throughput Nucleotide Sequencing , Open Reading Frames
13.
Phys Rev E ; 104(3-1): 034109, 2021 Sep.
Article in English | MEDLINE | ID: mdl-34654094

ABSTRACT

Restricted Boltzmann machines (RBM) are bilayer neural networks used for the unsupervised learning of model distributions from data. The bipartite architecture of RBM naturally defines an elegant sampling procedure, called alternating Gibbs sampling (AGS), where the configurations of the latent-variable layer are sampled conditional to the data-variable layer and vice versa. We study here the performance of AGS on several analytically tractable models borrowed from statistical mechanics. We show that standard AGS is not more efficient than classical Metropolis-Hastings (MH) sampling of the effective energy landscape defined on the data layer. However, RBM can identify meaningful representations of training data in their latent space. Furthermore, using these representations and combining Gibbs sampling with the MH algorithm in the latent space can enhance the sampling performance of the RBM when the hidden units encode weakly dependent features of the data. We illustrate our findings on three datasets: Bars and Stripes and MNIST, well known in machine learning, and the so-called lattice proteins dataset, introduced in theoretical biology to study the sequence-to-structure mapping in proteins.

14.
PLoS Comput Biol ; 17(9): e1009297, 2021 09.
Article in English | MEDLINE | ID: mdl-34473697

ABSTRACT

With the increasing ability to use high-throughput next-generation sequencing to quantify the diversity of the human T cell receptor (TCR) repertoire, the ability to use TCR sequences to infer antigen-specificity could greatly aid potential diagnostics and therapeutics. Here, we use a machine-learning approach known as Restricted Boltzmann Machine to develop a sequence-based inference approach to identify antigen-specific TCRs. Our approach combines probabilistic models of TCR sequences with clone abundance information to extract TCR sequence motifs central to an antigen-specific response. We use this model to identify patient personalized TCR motifs that respond to individual tumor and infectious disease antigens, and to accurately discriminate specific from non-specific responses. Furthermore, the hidden structure of the model results in an interpretable representation space where TCRs responding to the same antigen cluster, correctly discriminating the response of TCR to different viral epitopes. The model can be used to identify condition specific responding TCRs. We focus on the examples of TCRs reactive to candidate neoantigens and selected epitopes in experiments of stimulated TCR clone expansion.


Subject(s)
Computational Biology/methods , Models, Statistical , T-Lymphocytes/immunology , Cancer Survivors , Carcinoma, Pancreatic Ductal/immunology , Cluster Analysis , Datasets as Topic , Humans , Pancreatic Neoplasms/immunology , Receptors, Antigen, T-Cell/immunology
15.
Phys Rev E ; 103(5-1): 052413, 2021 May.
Article in English | MEDLINE | ID: mdl-34134280

ABSTRACT

Affinity maturation (AM) is the process through which the immune system is able to develop potent antibodies against new pathogens it encounters, and is at the base of the efficacy of vaccines. At its core AM is analogous to a Darwinian evolutionary process, where B cells mutate and are selected on the base of their affinity for an antigen (Ag), and Ag availability tunes the selective pressure. In cases when this selective pressure is high, the number of B cells might quickly decrease and the population might risk extinction in what is known as a population bottleneck. Here we study the probability for a B-cell lineage to survive this bottleneck scenario as a function of the progenitor affinity for the Ag. Using recursive relations and probability generating functions we derive expressions for the average extinction time and progeny size for lineages that go extinct. We then extend our results to the full population, both in the absence and presence of competition for T-cell help, and quantify the population survival probability as a function of Ag concentration and initial population size. Our study suggests the population bottleneck phenomenology might represent a limit case in the space of biologically plausible maturation scenarios, whose characterization could help guide the process of vaccine development.


Subject(s)
Antibody Affinity , B-Lymphocytes/immunology
16.
Bioinformatics ; 37(22): 4083-4090, 2021 11 18.
Article in English | MEDLINE | ID: mdl-34117879

ABSTRACT

MOTIVATION: Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family. RESULTS: We introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments. AVAILABILITY AND IMPLEMENTATION: Data and code available at https://github.com/CyrilMa/ssqa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Proteins , Proteins/chemistry , Amino Acid Sequence , Sequence Alignment , Protein Structure, Secondary , Mutagenesis
17.
PLoS Comput Biol ; 17(3): e1008751, 2021 03.
Article in English | MEDLINE | ID: mdl-33765014

ABSTRACT

The sequences of antibodies from a given repertoire are highly diverse at few sites located on the surface of a genome-encoded larger scaffold. The scaffold is often considered to play a lesser role than highly diverse, non-genome-encoded sites in controlling binding affinity and specificity. To gauge the impact of the scaffold, we carried out quantitative phage display experiments where we compare the response to selection for binding to four different targets of three different antibody libraries based on distinct scaffolds but harboring the same diversity at randomized sites. We first show that the response to selection of an antibody library may be captured by two measurable parameters. Second, we provide evidence that one of these parameters is determined by the degree of affinity maturation of the scaffold, affinity maturation being the process by which antibodies accumulate somatic mutations to evolve towards higher affinities during the natural immune response. In all cases, we find that libraries of antibodies built around maturated scaffolds have a lower response to selection to other arbitrary targets than libraries built around germline-based scaffolds. We thus propose that germline-encoded scaffolds have a higher selective potential than maturated ones as a consequence of a selection for this potential over the long-term evolution of germline antibody genes. Our results are a first step towards quantifying the evolutionary potential of biomolecules.


Subject(s)
Antibodies/genetics , Gene Library , Computational Biology , DNA/genetics , Evolution, Molecular , Humans
18.
Mol Biol Evol ; 38(6): 2428-2445, 2021 05 19.
Article in English | MEDLINE | ID: mdl-33555346

ABSTRACT

COVID-19 can lead to acute respiratory syndrome, which can be due to dysregulated immune signaling. We analyze the distribution of CpG dinucleotides, a pathogen-associated molecular pattern, in the SARS-CoV-2 genome. We characterize CpG content by a CpG force that accounts for statistical constraints acting on the genome at the nucleotidic and amino acid levels. The CpG force, as the CpG content, is overall low compared with other pathogenic betacoronaviruses; however, it widely fluctuates along the genome, with a particularly low value, comparable with the circulating seasonal HKU1, in the spike coding region and a greater value, comparable with SARS and MERS, in the highly expressed nucleocapside coding region (N ORF), whose transcripts are relatively abundant in the cytoplasm of infected cells and present in the 3'UTRs of all subgenomic RNA. This dual nature of CpG content could confer to SARS-CoV-2 the ability to avoid triggering pattern recognition receptors upon entry, while eliciting a stronger response during replication. We then investigate the evolution of synonymous mutations since the outbreak of the COVID-19 pandemic, finding a signature of CpG loss in regions with a greater CpG force. Sequence motifs preceding the CpG-loss-associated loci in the N ORF match recently identified binding patterns of the zinc finger antiviral protein. Using a model of the viral gene evolution under human host pressure, we find that synonymous mutations seem driven in the SARS-CoV-2 genome, and particularly in the N ORF, by the viral codon bias, the transition-transversion bias, and the pressure to lower CpG content.


Subject(s)
COVID-19/genetics , CpG Islands , Evolution, Molecular , Genome, Viral , RNA, Viral/genetics , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , SARS-CoV-2/pathogenicity
19.
Cell Syst ; 12(2): 195-202.e9, 2021 02 17.
Article in English | MEDLINE | ID: mdl-33338400

ABSTRACT

The recent increase of immunopeptidomics data, obtained by mass spectrometry or binding assays, opens up possibilities for investigating endogenous antigen presentation by the highly polymorphic human leukocyte antigen class I (HLA-I) protein. State-of-the-art methods predict with high accuracy presentation by HLA alleles that are well represented in databases at the time of release but have a poorer performance for rarer and less characterized alleles. Here, we introduce a method based on Restricted Boltzmann Machines (RBMs) for prediction of antigens presented on the Major Histocompatibility Complex (MHC) encoded by HLA genes-RBM-MHC. RBM-MHC can be trained on custom and newly available samples with no or a small amount of HLA annotations. RBM-MHC ensures improved predictions for rare alleles and matches state-of-the-art performance for well-characterized alleles while being less data demanding. RBM-MHC is shown to be a flexible and easily interpretable method that can be used as a predictor of cancer neoantigens and viral epitopes, as a tool for feature discovery, and to reconstruct peptide motifs presented on specific HLA molecules.


Subject(s)
Antigen Presentation/immunology , Computational Biology/methods , Histocompatibility Antigens Class I/genetics , Histocompatibility Antigens Class I/immunology , Algorithms , Alleles , Antigen Presentation/genetics , Databases, Protein , Epitopes , HLA Antigens/genetics , HLA Antigens/immunology , Humans , Machine Learning , Major Histocompatibility Complex/immunology , Mass Spectrometry/methods , Models, Theoretical , Peptides/chemistry , Protein Binding
20.
Science ; 369(6502): 440-445, 2020 07 24.
Article in English | MEDLINE | ID: mdl-32703877

ABSTRACT

The rational design of enzymes is an important goal for both fundamental and practical reasons. Here, we describe a process to learn the constraints for specifying proteins purely from evolutionary sequence data, design and build libraries of synthetic genes, and test them for activity in vivo using a quantitative complementation assay. For chorismate mutase, a key enzyme in the biosynthesis of aromatic amino acids, we demonstrate the design of natural-like catalytic function with substantial sequence diversity. Further optimization focuses the generative model toward function in a specific genomic context. The data show that sequence-based statistical models suffice to specify proteins and provide access to an enormous space of functional sequences. This result provides a foundation for a general process for evolution-based design of artificial proteins.


Subject(s)
Chorismate Mutase , Evolution, Molecular , Models, Genetic , Models, Statistical , Amino Acid Sequence , Chorismate Mutase/chemistry , Chorismate Mutase/genetics , Escherichia coli Proteins/chemistry , Escherichia coli Proteins/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...