Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
Trends Pharmacol Sci ; 45(3): 255-267, 2024 03.
Artigo em Inglês | MEDLINE | ID: mdl-38378385

RESUMO

Generative biology combines artificial intelligence (AI), advanced life sciences technologies, and automation to revolutionize the process of designing novel biomolecules with prescribed properties, giving drug discoverers the ability to escape the limitations of biology during the design of next-generation protein therapeutics. Significant hurdles remain, namely: (i) the inherently complex nature of drug discovery, (ii) the bewildering number of promising computational and experimental techniques that have emerged in the past several years, and (iii) the limited availability of relevant protein sequence-function data for drug-like molecules. There is a need to focus on computational methods that will be most practically effective for protein drug discovery and on building experimental platforms to generate the data most appropriate for these methods. Here, we discuss recent advances in computational and experimental life sciences that are most crucial for impacting the pace and success of protein drug discovery.


Assuntos
Inteligência Artificial , Descoberta de Drogas , Humanos , Descoberta de Drogas/métodos , Biologia
2.
MAbs ; 15(1): 2256745, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37698932

RESUMO

Biologic drug discovery pipelines are designed to deliver protein therapeutics that have exquisite functional potency and selectivity while also manifesting biophysical characteristics suitable for manufacturing, storage, and convenient administration to patients. The ability to use computational methods to predict biophysical properties from protein sequence, potentially in combination with high throughput assays, could decrease timelines and increase the success rates for therapeutic developability engineering by eliminating lengthy and expensive cycles of recombinant protein production and testing. To support development of high-quality predictive models for antibody developability, we designed a sequence-diverse panel of 83 effector functionless IgG1 antibodies displaying a range of biophysical properties, produced and formulated each protein under standard platform conditions, and collected a comprehensive package of analytical data, including in vitro assays and in vivo mouse pharmacokinetics. We used this robust training data set to build machine learning classifier models that can predict complex protein behavior from these data and features derived from predicted and/or experimental structures. Our models predict with 87% accuracy whether viscosity at 150 mg/mL is above or below a threshold of 15 centipoise (cP) and with 75% accuracy whether the area under the plasma drug concentration-time curve (AUC0-672 h) in normal mouse is above or below a threshold of 3.9 × 106 h x ng/mL.


Assuntos
Anticorpos Monoclonais , Descoberta de Drogas , Animais , Camundongos , Anticorpos Monoclonais/química , Simulação por Computador , Proteínas Recombinantes , Viscosidade
3.
Proteins ; 91(11): 1471-1486, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37337902

RESUMO

Protein engineers aim to discover and design novel sequences with targeted, desirable properties. Given the near limitless size of the protein sequence landscape, it is no surprise that these desirable sequences are often a relative rarity. This makes identifying such sequences a costly and time-consuming endeavor. In this work, we show how to use a deep transformer protein language model to identify sequences that have the most promise. Specifically, we use the model's self-attention map to calculate a Promise Score that weights the relative importance of a given sequence according to predicted interactions with a specified binding partner. This Promise Score can then be used to identify strong binders worthy of further study and experimentation. We use the Promise Score within two protein engineering contexts-Nanobody (Nb) discovery and protein optimization. With Nb discovery, we show how the Promise Score provides an effective way to select lead sequences from Nb repertoires. With protein optimization, we show how to use the Promise Score to select site-specific mutagenesis experiments that identify a high percentage of improved sequences. In both cases, we also show how the self-attention map used to calculate the Promise Score can indicate which regions of a protein are involved in intermolecular interactions that drive the targeted property. Finally, we describe how to fine-tune the transformer protein language model to learn a predictive model for the targeted property, and discuss the capabilities and limitations of fine-tuning with and without knowledge transfer within the context of protein engineering.


Assuntos
Idioma , Engenharia de Proteínas , Mutagênese Sítio-Dirigida , Sequência de Aminoácidos , Projetos de Pesquisa
4.
Expert Rev Hematol ; 16(sup1): 107-127, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36920855

RESUMO

BACKGROUND: The National Hemophilia Foundation (NHF) conducted extensive, inclusive community consultations to guide prioritization of research in coming decades in alignment with its mission to find cures and address and prevent complications enabling people and families with blood disorders to thrive. RESEARCH DESIGN AND METHODS: With the American Thrombosis and Hemostasis Network, NHF recruited multidisciplinary expert working groups (WG) to distill the community-identified priorities into concrete research questions and score their feasibility, impact, and risk. WG6 was charged with identifying the infrastructure, workforce development, and funding and resources to facilitate the prioritized research. Community input on conclusions was gathered at the NHF State of the Science Research Summit. RESULTS: WG6 detailed a minimal research capacity infrastructure threshold, and opportunities to enable its attainment, for bleeding disorders centers to participate in prospective, multicenter national registries. They identified challenges and opportunities to recruit, retain, and train the diverse multidisciplinary care and research workforce required into the future. Innovative collaborative approaches to trial design, resource networking, and funding to surmount obstacles facing research in rare disorders were elucidated. CONCLUSIONS: The innovations in infrastructure, workforce development, and resources and funding proposed herein may contribute to facilitating a National Research Blueprint for Inherited Bleeding Disorders.


Research is critical to advancing the diagnosis and care of people with inherited bleeding disorders (PWIBD). This research requires significant infrastructure, including people and resources. Hemophilia treatment centers (HTC) need many different skilled care professionals including doctors, nurses, and other providers; also statisticians, data managers, and other experts to process patients' clinical information into research. Attracting diverse qualified professionals to the clinical and research work requires long-term planning, recruiting individuals in training programs and retaining them as they become experts. Research infrastructure includes physical servers running database software, networks that link them, and the environment in which these components function. US Centers for Disease Control and Prevention (CDC) and American Thrombosis and Hemostasis Network (ATHN) coordinate and fund data collection at HTCs on the health and well-being of thousands of PWIBD into a registry used in research studies.National Hemophilia Foundation (NHF) and ATHN asked our group of health care professionals, technology experts, and lived experience experts (LEE) to identify the infrastructure, workforce, and resources needed to do the research most important to PWIBD. We identified the types of CDC/ATHN studies all HTCs should be able to perform, and the physical and human infrastructure this requires. We prioritized finding the best clinical trial designs to study inherited bleeding disorders, identifying ways to share personnel and tools between HTCs, and innovating how research is governed and funded. Involving LEEs in designing, managing, and carrying out research will be key in conducting research to improve the lives of PWIBD.


Assuntos
Hemofilia A , Trombose , Humanos , Estados Unidos , Estudos Prospectivos , Hemostasia , Recursos Humanos
5.
PLOS Digit Health ; 1(2): e0000012, 2022 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-36812511

RESUMO

Sepsis is a potentially life-threatening inflammatory response to infection or severe tissue damage. It has a highly variable clinical course, requiring constant monitoring of the patient's state to guide the management of intravenous fluids and vasopressors, among other interventions. Despite decades of research, there's still debate among experts on optimal treatment. Here, we combine for the first time, distributional deep reinforcement learning with mechanistic physiological models to find personalized sepsis treatment strategies. Our method handles partial observability by leveraging known cardiovascular physiology, introducing a novel physiology-driven recurrent autoencoder, and quantifies the uncertainty of its own results. Moreover, we introduce a framework for uncertainty-aware decision support with humans in the loop. We show that our method learns physiologically explainable, robust policies, that are consistent with clinical knowledge. Further our method consistently identifies high-risk states that lead to death, which could potentially benefit from more frequent vasopressor administration, providing valuable guidance for future research.

6.
Algorithms Mol Biol ; 16(1): 13, 2021 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-34210336

RESUMO

BACKGROUND: Directed evolution (DE) is a technique for protein engineering that involves iterative rounds of mutagenesis and screening to search for sequences that optimize a given property, such as binding affinity to a specified target. Unfortunately, the underlying optimization problem is under-determined, and so mutations introduced to improve the specified property may come at the expense of unmeasured, but nevertheless important properties (ex. solubility, thermostability, etc). We address this issue by formulating DE as a regularized Bayesian optimization problem where the regularization term reflects evolutionary or structure-based constraints. RESULTS: We applied our approach to DE to three representative proteins, GB1, BRCA1, and SARS-CoV-2 Spike, and evaluated both evolutionary and structure-based regularization terms. The results of these experiments demonstrate that: (i) structure-based regularization usually leads to better designs (and never hurts), compared to the unregularized setting; (ii) evolutionary-based regularization tends to be least effective; and (iii) regularization leads to better designs because it effectively focuses the search in certain areas of sequence space, making better use of the experimental budget. Additionally, like previous work in Machine learning assisted DE, we find that our approach significantly reduces the experimental burden of DE, relative to model-free methods. CONCLUSION: Introducing regularization into a Bayesian ML-assisted DE framework alters the exploratory patterns of the underlying optimization routine, and can shift variant selections towards those with a range of targeted and desirable properties. In particular, we find that structure-based regularization often improves variant selection compared to unregularized approaches, and never hurts.

7.
Bioinformatics ; 37(Suppl_1): i451-i459, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252975

RESUMO

MOTIVATION: The recent emergence of cloud laboratories-collections of automated wet-lab instruments that are accessed remotely, presents new opportunities to apply Artificial Intelligence and Machine Learning in scientific research. Among these is the challenge of automating the process of optimizing experimental protocols to maximize data quality. RESULTS: We introduce a new deterministic algorithm, called PaRallel OptimizaTiOn for ClOud Laboratories (PROTOCOL), that improves experimental protocols via asynchronous, parallel Bayesian optimization. The algorithm achieves exponential convergence with respect to simple regret. We demonstrate PROTOCOL in both simulated and real-world cloud labs. In the simulated lab, it outperforms alternative approaches to Bayesian optimization in terms of its ability to find optimal configurations, and the number of experiments required to find the optimum. In the real-world lab, the algorithm makes progress toward the optimal setting. DATA AVAILABILITY AND IMPLEMENTATION: PROTOCOL is available as both a stand-alone Python library, and as part of a R Shiny application at https://github.com/clangmead/PROTOCOL. Data are available at the same repository. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Inteligência Artificial , Software , Algoritmos , Teorema de Bayes , Laboratórios
8.
J Comput Biol ; 22(6): 474-86, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25973864

RESUMO

In studying the strength and specificity of interaction between members of two protein families, key questions center on which pairs of possible partners actually interact, how well they interact, and why they interact while others do not. The advent of large-scale experimental studies of interactions between members of a target family and a diverse set of possible interaction partners offers the opportunity to address these questions. We develop here a method, DgSpi (data-driven graphical models of specificity in protein:protein interactions), for learning and using graphical models that explicitly represent the amino acid basis for interaction specificity (why) and extend earlier classification-oriented approaches (which) to predict the ΔG of binding (how well). We demonstrate the effectiveness of our approach in analyzing and predicting interactions between a set of 82 PDZ recognition modules against a panel of 217 possible peptide partners, based on data from MacBeath and colleagues. Our predicted ΔG values are highly predictive of the experimentally measured ones, reaching correlation coefficients of 0.69 in 10-fold cross-validation and 0.63 in leave-one-PDZ-out cross-validation. Furthermore, the model serves as a compact representation of amino acid constraints underlying the interactions, enabling protein-level ΔG predictions to be naturally understood in terms of residue-level constraints. Finally, the model DgSpi readily enables the design of new interacting partners, and we demonstrate that designed ligands are novel and diverse.


Assuntos
Ligação Proteica/genética , Proteínas/genética , Sequência de Aminoácidos , Aminoácidos/genética , Ligantes , Modelos Moleculares , Sensibilidade e Especificidade
9.
Res Comput Mol Biol ; 8394: 129-143, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25414914

RESUMO

In studying the strength and specificity of interaction between members of two protein families, key questions center on which pairs of possible partners actually interact, how well they interact, and why they interact while others do not. The advent of large-scale experimental studies of interactions between members of a target family and a diverse set of possible interaction partners offers the opportunity to address these questions. We develop here a method, DgSpi (Data-driven Graphical models of Specificity in Protein:protein Interactions), for learning and using graphical models that explicitly represent the amino acid basis for interaction specificity (why) and extend earlier classification-oriented approaches (which) to predict the ΔG of binding (how well). We demonstrate the effectiveness of our approach in analyzing and predicting interactions between a set of 82 PDZ recognition modules, against a panel of 217 possible peptide partners, based on data from MacBeath and colleagues. Our predicted ΔG values are highly predictive of the experimentally measured ones, reaching correlation coefficients of 0.69 in 10-fold cross-validation and 0.63 in leave-one-PDZ-out cross-validation. Furthermore, the model serves as a compact representation of amino acid constraints underlying the interactions, enabling protein-level ΔG predictions to be naturally understood in terms of residue-level constraints. Finally, as a generative model, DgSpi readily enables the design of new interacting partners, and we demonstrate that designed ligands are novel and diverse.

10.
Adv Exp Med Biol ; 805: 87-105, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24446358

RESUMO

Atomistic simulations of the conformational dynamics of proteins can be performed using either Molecular Dynamics or Monte Carlo procedures. The ensembles of three-dimensional structures produced during simulation can be analyzed in a number of ways to elucidate the thermodynamic and kinetic properties of the system. The goal of this chapter is to review both traditional and emerging methods for learning generative models from atomistic simulation data. Here, the term 'generative' refers to a model of the joint probability distribution over the behaviors of the constituent atoms. In the context of molecular modeling, generative models reveal the correlation structure between the atoms, and may be used to predict how the system will respond to structural perturbations. We begin by discussing traditional methods, which produce multivariate Gaussian models. We then discuss GAMELAN (GRAPHICAL MODELS OF ENERGY LANDSCAPES), which produces generative models of complex, non-Gaussian conformational dynamics (e.g., allostery, binding, folding, etc.) from long timescale simulation data.


Assuntos
Modelos Estatísticos , Simulação de Dinâmica Molecular , Método de Monte Carlo , Regulação Alostérica , Anticorpos Monoclonais/química , Antígenos CD4/química , Proteína gp120 do Envelope de HIV/química , Inibidores da Fusão de HIV/química , HIV-1/química , Proteínas de Homeodomínio/química , Humanos , Distribuição Normal , Ligação Proteica , Conformação Proteica , Dobramento de Proteína
11.
BMC Biophys ; 5: 13, 2012 Jun 29.
Artigo em Inglês | MEDLINE | ID: mdl-22748306

RESUMO

BACKGROUND: G protein coupled receptors (GPCRs) are seven helical transmembrane proteins that function as signal transducers. They bind ligands in their extracellular and transmembrane regions and activate cognate G proteins at their intracellular surface at the other side of the membrane. The relay of allosteric communication between the ligand binding site and the distant G protein binding site is poorly understood. In this study, GREMLIN 1, a recently developed method that identifies networks of co-evolving residues from multiple sequence alignments, was used to identify those that may be involved in communicating the activation signal across the membrane. The GREMLIN-predicted long-range interactions between amino acids were analyzed with respect to the seven GPCR structures that have been crystallized at the time this study was undertaken. RESULTS: GREMLIN significantly enriches the edges containing residues that are part of the ligand binding pocket, when compared to a control distribution of edges drawn from a random graph. An analysis of these edges reveals a minimal GPCR binding pocket containing four residues (T1183.33, M2075.42, Y2686.51 and A2927.39). Additionally, of the ten residues predicted to have the most long-range interactions (A1173.32, A2726.55, E1133.28, H2115.46, S186EC2, A2927.39, E1223.37, G902.57, G1143.29 and M2075.42), nine are part of the ligand binding pocket. CONCLUSIONS: We demonstrate the use of GREMLIN to reveal a network of statistically correlated and functionally important residues in class A GPCRs. GREMLIN identified that ligand binding pocket residues are extensively correlated with distal residues. An analysis of the GREMLIN edges across multiple structures suggests that there may be a minimal binding pocket common to the seven known GPCRs. Further, the activation of rhodopsin involves these long-range interactions between extracellular and intracellular domain residues mediated by the retinal domain.

12.
BMC Bioinformatics ; 13 Suppl 5: S8, 2012 Apr 12.
Artigo em Inglês | MEDLINE | ID: mdl-22537012

RESUMO

Stochastic Differential Equations (SDE) are often used to model the stochastic dynamics of biological systems. Unfortunately, rare but biologically interesting behaviors (e.g., oncogenesis) can be difficult to observe in stochastic models. Consequently, the analysis of behaviors of SDE models using numerical simulations can be challenging. We introduce a method for solving the following problem: given a SDE model and a high-level behavioral specification about the dynamics of the model, algorithmically decide whether the model satisfies the specification. While there are a number of techniques for addressing this problem for discrete-state stochastic models, the analysis of SDE and other continuous-state models has received less attention. Our proposed solution uses a combination of Bayesian sequential hypothesis testing, non-identically distributed samples, and Girsanov's theorem for change of measures to examine rare behaviors. We use our algorithm to analyze two SDE models of tumor dynamics. Our use of non-identically distributed samples sampling contributes to the state of the art in statistical verification and model checking of stochastic models by providing an effective means for exposing rare events in SDEs, while retaining the ability to compute bounds on the probability that those events occur.


Assuntos
Algoritmos , Transformação Celular Neoplásica , Modelos Biológicos , Processos Estocásticos , Teorema de Bayes , Humanos , Probabilidade
13.
Proteins ; 79(4): 1061-78, 2011 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-21268112

RESUMO

We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy.


Assuntos
Modelos Químicos , Dobramento de Proteína , Proteínas/química , Alinhamento de Sequência/métodos , Sequência de Aminoácidos , Área Sob a Curva , Biologia Computacional , Gráficos por Computador , Simulação por Computador , Cadeias de Markov , Modelos Moleculares , Modelos Estatísticos , Domínios PDZ , Análise de Sequência de Proteína , Relação Estrutura-Atividade
14.
Proteins ; 79(2): 444-62, 2011 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21120864

RESUMO

Protein-protein interactions are governed by the change in free energy upon binding, ΔG = ΔH - TΔS. These interactions are often marginally stable, so one must examine the balance between the change in enthalpy, ΔH, and the change in entropy, ΔS, when investigating known complexes, characterizing the effects of mutations, or designing optimized variants. To perform a large-scale study into the contribution of conformational entropy to binding free energy, we developed a technique called GOBLIN (Graphical mOdel for BiomoLecular INteractions) that performs physics-based free energy calculations for protein-protein complexes under both side-chain and backbone flexibility. Goblin uses a probabilistic graphical model that exploits conditional independencies in the Boltzmann distribution and employs variational inference techniques that approximate the free energy of binding in only a few minutes. We examined the role of conformational entropy on a benchmark set of more than 700 mutants in eight large, well-studied complexes. Our findings suggest that conformational entropy is important in protein-protein interactions--the root mean square error (RMSE) between calculated and experimentally measured ΔΔGs decreases by 12% when explicit entropic contributions were incorporated. GOBLIN models all atoms of the protein complex and detects changes to the binding entropy along the interface as well as positions distal to the binding interface. Our results also suggest that a variational approach to entropy calculations may be quantitatively more accurate than the knowledge-based approaches used by the well-known programs FOLDX and Rosetta--GOBLIN's RMSEs are 10 and 36% lower than these programs, respectively.


Assuntos
Proteínas/química , Algoritmos , Aminoácidos/química , Animais , Simulação por Computador , Entropia , Humanos , Cadeias de Markov , Modelos Moleculares , Simulação de Dinâmica Molecular , Mutação , Ligação Proteica , Domínios e Motivos de Interação entre Proteínas , Estrutura Quaternária de Proteína , Proteínas/genética , Software
15.
J Bioinform Comput Biol ; 7(2): 323-38, 2009 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-19340918

RESUMO

We present an exact algorithm, based on techniques from the field of Model Checking, for finding control policies for Boolean Networks (BN) with control nodes. Given a BN, a set of starting states, I, a set of goal states, F, and a target time, t, our algorithm automatically finds a sequence of control signals that deterministically drives the BN from I to F at, or before time t, or else guarantees that no such policy exists. Despite recent hardness-results for finding control policies for BNs, we show that, in practice, our algorithm runs in seconds to minutes on over 13,400 BNs of varying sizes and topologies, including a BN model of embryogenesis in Drosophila melanogaster with 15,360 Boolean variables. We then extend our method to automatically identify a set of Boolean transfer functions that reproduce the qualitative behavior of gene regulatory networks. Specifically, we automatically learn a BN model of D. melanogaster embryogenesis in 5.3 seconds, from a space containing 6.9 x 10(10) possible models.


Assuntos
Proteínas de Drosophila/metabolismo , Drosophila melanogaster/embriologia , Drosophila melanogaster/fisiologia , Desenvolvimento Embrionário/fisiologia , Modelos Biológicos , Mapeamento de Interação de Proteínas/métodos , Transdução de Sinais/fisiologia , Animais , Retroalimentação/fisiologia
16.
J Comput Biol ; 11(2-3): 277-98, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15285893

RESUMO

High-throughput NMR structural biology can play an important role in structural genomics. We report an automated procedure for high-throughput NMR resonance assignment for a protein of known structure, or of a homologous structure. These assignments are a prerequisite for probing protein-protein interactions, protein-ligand binding, and dynamics by NMR. Assignments are also the starting point for structure determination and refinement. A new algorithm, called Nuclear Vector Replacement (NVR) is introduced to compute assignments that optimally correlate experimentally measured NH residual dipolar couplings (RDCs) to a given a priori whole-protein 3D structural model. The algorithm requires only uniform( 15)N-labeling of the protein and processes unassigned H(N)-(15)N HSQC spectra, H(N)-(15)N RDCs, and sparse H(N)-H(N) NOE's (d(NN)s), all of which can be acquired in a fraction of the time needed to record the traditional suite of experiments used to perform resonance assignments. NVR runs in minutes and efficiently assigns the (H(N),(15)N) backbone resonances as well as the d(NN)s of the 3D (15)N-NOESY spectrum, in O(n(3)) time. The algorithm is demonstrated on NMR data from a 76-residue protein, human ubiquitin, matched to four structures, including one mutant (homolog), determined either by x-ray crystallography or by different NMR experiments (without RDCs). NVR achieves an assignment accuracy of 92-100%. We further demonstrate the feasibility of our algorithm for different and larger proteins, using NMR data for hen lysozyme (129 residues, 97-100% accuracy) and streptococcal protein G (56 residues, 100% accuracy), matched to a variety of 3D structural models. Finally, we extend NVR to a second application, 3D structural homology detection, and demonstrate that NVR is able to identify structural homologies between proteins with remote amino acid sequences using a database of structural models.


Assuntos
Algoritmos , Biologia Computacional , Espectroscopia de Ressonância Magnética/estatística & dados numéricos , Estrutura Terciária de Proteína
17.
J Biomol NMR ; 29(2): 111-38, 2004 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-15014227

RESUMO

We report an automated procedure for high-throughput NMR resonance assignment for a protein of known structure, or of an homologous structure. Our algorithm performs Nuclear Vector Replacement (NVR) by Expectation/Maximization (EM) to compute assignments. NVR correlates experimentally-measured NH residual dipolar couplings (RDCs) and chemical shifts to a given a priori whole-protein 3D structural model. The algorithm requires only uniform (15)N-labelling of the protein, and processes unassigned H(N)-(15)N HSQC spectra, H(N)-(15)N RDCs, and sparse H(N)-H(N) NOE's (d(NN)s). NVR runs in minutes and efficiently assigns the (H(N),(15)N) backbone resonances as well as the sparse d(NN)s from the 3D (15)N-NOESY spectrum, in O (n(3)) time. The algorithm is demonstrated on NMR data from a 76-residue protein, human ubiquitin, matched to four structures, including one mutant (homolog), determined either by X-ray crystallography or by different NMR experiments (without RDCs). NVR achieves an average assignment accuracy of over 99%. We further demonstrate the feasibility of our algorithm for different and larger proteins, using different combinations of real and simulated NMR data for hen lysozyme (129 residues) and streptococcal protein G (56 residues), matched to a variety of 3D structural models.


Assuntos
Algoritmos , Simulação por Computador , Espectroscopia de Ressonância Magnética/métodos , Animais , Isótopos de Carbono/química , Cristalografia por Raios X , Humanos , Estrutura Molecular , Muramidase/química , Isótopos de Nitrogênio/química , Ubiquitina/química
18.
Artigo em Inglês | MEDLINE | ID: mdl-16448021

RESUMO

One goal of the structural genomics initiative is the identification of new protein folds. Sequence-based structural homology prediction methods are an important means for prioritizing unknown proteins for structure determination. However, an important challenge remains: two highly dissimilar sequences can have similar folds & how can we detect this rapidly, in the context of structural genomics? High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this challenge. We report an automated procedure, called HD, for detecting 3D structural homologies from sparse, unassigned protein NMR data. Our method identifies 3D models in a protein structural database whose geometries best fit the unassigned experimental NMR data. HD does not use, and is thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions made by other techniques such as protein threading or homology modelling. The algorithm runs in O(pn + pn(5/2) log (cn)+p log p) time, where p is the number of proteins in the database, n is the number of residues in the target protein and c is the maximum edge weight in an integer-weighted bipartite graph. Our experiments on real NMR data from 3 different proteins against a database of 4,500 representative folds demonstrate that the method identifies closely related protein folds, including sub-domains of larger proteins, with as little as 10-30% sequence homology between the target protein (or sub-domain) and the computed model. In particular, we report no false-negatives or false-positives despite significant percentages of missing experimental data.


Assuntos
Cristalografia por Raios X/métodos , Modelos Químicos , Modelos Moleculares , Mapeamento de Peptídeos/métodos , Proteínas/química , Proteínas/ultraestrutura , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Inteligência Artificial , Simulação por Computador , Imageamento Tridimensional/métodos , Dados de Sequência Molecular , Conformação Proteica , Proteínas/análise , Homologia de Sequência de Aminoácidos
19.
J Comput Biol ; 10(3-4): 521-36, 2003.
Artigo em Inglês | MEDLINE | ID: mdl-12935342

RESUMO

We introduce a model-based analysis technique for extracting and characterizing rhythmic expression profiles from genome-wide DNA microarray hybridization data. These patterns are clues to discovering rhythmic genes implicated in cell-cycle, circadian, or other biological processes. The algorithm, implemented in a program called RAGE (Rhythmic Analysis of Gene Expression), decouples the problems of estimating a pattern's wavelength and phase. Our algorithm is linear-time in frequency and phase resolution, an improvement over previous quadratic-time approaches. Unlike previous approaches, RAGE uses a true distance metric for measuring expression profile similarity, based on the Hausdorff distance. This results in better clustering of expression profiles for rhythmic analysis. The confidence of each frequency estimate is computed using Z-scores. We demonstrate that RAGE is superior to other techniques on synthetic and actual DNA microarray hybridization data. We also show how to replace the discretized phase search in our method with an exact (combinatorially precise) phase search, resulting in a faster algorithm with no complexity dependence on phase resolution.


Assuntos
Biologia Computacional/métodos , Interpretação Estatística de Dados , Perfilação da Expressão Gênica/métodos , Algoritmos , Animais , Proteína Quinase CDC28 de Saccharomyces cerevisiae/genética , Proteínas de Ciclo Celular/genética , Ritmo Circadiano/genética , Drosophila/genética , Proteínas de Ligação ao GTP/genética , Genoma , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos
20.
Artigo em Inglês | MEDLINE | ID: mdl-16452795

RESUMO

Recognition of a protein's fold provides valuable information about its function. While many sequence-based homology prediction methods exist, an important challenge remains: two highly dissimilar sequences can have similar folds-- how can we detect this rapidly, in the context of structural genomics? High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this challenge. We report an automated procedure for detecting 3D structural homologies from sparse, unassigned protein NMR data. Our method identifies the 3D structural models in a protein structural database whose geometries best fit the unassigned experimental NMR data. It does not use sequence information and is thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions made by other techniques such as protein threading or sequence homology. The algorithm runs in O(pnk(3)) time, where p is the number of proteins in the database, n is the number of residues in the target protein, and k is the resolution of a rotation search. The method requires only uniform (15)N-labelling of the protein and processes unassigned H(N)-(15)N residual dipolar couplings, which can be acquired in a couple of hours. Our experiments on NMR data from 5 different proteins demonstrate that the method identifies closely related protein folds, despite low-sequence homology between the target protein and the computed model.


Assuntos
Algoritmos , Inteligência Artificial , Espectroscopia de Ressonância Magnética/métodos , Mapeamento de Peptídeos/métodos , Proteínas/análise , Proteínas/química , Análise de Sequência de Proteína/métodos , Sítios de Ligação , Bases de Dados de Proteínas , Reconhecimento Automatizado de Padrão/métodos , Ligação Proteica , Conformação Proteica , Homologia de Sequência de Aminoácidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...