Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
1.
Comput Struct Biotechnol J ; 20: 1702-1715, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35495120

RESUMO

SPARC facilitates the generation of plausible hypotheses regarding underlying biochemical mechanisms by structurally characterizing protein sequence constraints. Such constraints appear as residues co-conserved in functionally related subgroups, as subtle pairwise correlations (i.e., direct couplings), and as correlations among these sequence features or with structural features. SPARC performs three types of analyses. First, based on pairwise sequence correlations, it estimates the biological relevance of alternative conformations and of homomeric contacts, as illustrated here for death domains. Second, it estimates the statistical significance of the correspondence between directly coupled residue pairs and interactions at heterodimeric interfaces. Third, given molecular dynamics simulated structures, it characterizes interactions among constrained residues or between such residues and ligands that: (a) are stably maintained during the simulation; (b) undergo correlated formation and/or disruption of interactions with other constrained residues; or (c) switch between alternative interactions. We illustrate this for two homohexameric complexes: the bacterial enhancer binding protein (bEBP) NtrC1, which activates transcription by remodeling RNA polymerase (RNAP) containing σ54, and for DnaB helicase, which opens DNA at the bacterial replication fork. Based on the NtrC1 analysis, we hypothesize possible mechanisms for inhibiting ATP hydrolysis until ADP is released from an adjacent subunit and for coupling ATP hydrolysis to restructuring of σ54 binding loops. Based on the DnaB analysis, we hypothesize that DnaB 'grabs' ssDNA by flipping every fourth base and inserting it into cavities between subunits and that flipping of a DnaB-specific glutamine residue triggers ATP hydrolysis.

2.
Int J Mol Sci ; 23(6)2022 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-35328445

RESUMO

Semaphorin 4A (Sema4A) exerts a stabilizing effect on human Treg cells in PBMC and CD4+ T cell cultures by engaging Plexin B1. Sema4A deficient mice display enhanced allergic airway inflammation accompanied by fewer Treg cells, while Sema4D deficient mice displayed reduced inflammation and increased Treg cell numbers even though both Sema4 subfamily members engage Plexin B1. The main objectives of this study were: 1. To compare the in vitro effects of Sema4A and Sema4D proteins on human Treg cells; and 2. To identify function-determining residues in Sema4A critical for binding to Plexin B1 based on Sema4D homology modeling. We report here that Sema4A and Sema4D display opposite effects on human Treg cells in in vitro PBMC cultures; Sema4D inhibited the CD4+CD25+Foxp3+ cell numbers and CD25/Foxp3 expression. Sema4A and Sema4D competitively bind to Plexin B1 in vitro and hence may be doing so in vivo as well. Bayesian Partitioning with Pattern Selection (BPPS) partitioned 4505 Sema domains from diverse organisms into subgroups based on distinguishing sequence patterns that are likely responsible for functional differences. BPPS groups Sema3 and Sema4 into one family and further separates Sema4A and Sema4D into distinct subfamilies. Residues distinctive of the Sema3,4 family and of Sema4A (and by homology of Sema4D) tend to cluster around the Plexin B1 binding site. This suggests that the residues both common to and distinctive of Sema4A and Sema4D may mediate binding to Plexin B1, with subfamily residues mediating functional specificity. We mutated the Sema4A-specific residues M198 and F223 to alanine; notably, F223 in Sema4A corresponds to alanine in Sema4D. Mutant proteins were assayed for Plexin B1-binding and Treg stimulation activities. The F223A mutant was unable to stimulate Treg stability in in vitro PBMC cultures despite binding Plexin B1 with an affinity similar to the WT protein. This research is a first step in generating potent mutant Sema4A molecules with stimulatory function for Treg cells with a view to designing immunotherapeutics for asthma.


Assuntos
Leucócitos Mononucleares , Semaforinas/metabolismo , Alanina , Animais , Teorema de Bayes , Fatores de Transcrição Forkhead/genética , Humanos , Inflamação , Leucócitos Mononucleares/metabolismo , Camundongos , Proteínas do Tecido Nervoso/metabolismo
4.
Sci Rep ; 11(1): 17663, 2021 09 03.
Artigo em Inglês | MEDLINE | ID: mdl-34480063

RESUMO

De novo transcriptome assembly from billions of RNA-seq reads is very challenging due to alternative splicing and various levels of expression, which often leads to incorrect, mis-assembled transcripts. BayesDenovo addresses this problem by using both a read-guided strategy to accurately reconstruct splicing graphs from the RNA-seq data and a Bayesian strategy to estimate, from these graphs, the probability of transcript expression without penalizing poorly expressed transcripts. Simulation and cell line benchmark studies demonstrate that BayesDenovo is very effective in reducing false positives and achieves much higher accuracy than other assemblers, especially for alternatively spliced genes and for highly or poorly expressed transcripts. Moreover, BayesDenovo is more robust on multiple replicates by assembling a larger portion of common transcripts. When applied to breast cancer data, BayesDenovo identifies phenotype-specific transcripts associated with breast cancer recurrence.


Assuntos
Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Transcriptoma , Teorema de Bayes , Simulação por Computador , Humanos , Análise de Sequência de RNA
5.
PLoS Comput Biol ; 17(7): e1009203, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-34292930

RESUMO

Transcription factors (TFs) often function as a module including both master factors and mediators binding at cis-regulatory regions to modulate nearby gene transcription. ChIP-seq profiling of multiple TFs makes it feasible to infer functional TF modules. However, when inferring TF modules based on co-localization of ChIP-seq peaks, often many weak binding events are missed, especially for mediators, resulting in incomplete identification of modules. To address this problem, we develop a ChIP-seq data-driven Gibbs Sampler to infer Modules (ChIP-GSM) using a Bayesian framework that integrates ChIP-seq profiles of multiple TFs. ChIP-GSM samples read counts of module TFs iteratively to estimate the binding potential of a module to each region and, across all regions, estimates the module abundance. Using inferred module-region probabilistic bindings as feature units, ChIP-GSM then employs logistic regression to predict active regulatory elements. Validation of ChIP-GSM predicted regulatory regions on multiple independent datasets sharing the same context confirms the advantage of using TF modules for predicting regulatory activity. In a case study of K562 cells, we demonstrate that the ChIP-GSM inferred modules form as groups, activate gene expression at different time points, and mediate diverse functional cellular processes. Hence, ChIP-GSM infers biologically meaningful TF modules and improves the prediction accuracy of regulatory region activities.


Assuntos
Sequenciamento de Cromatina por Imunoprecipitação/métodos , Redes Reguladoras de Genes , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Teorema de Bayes , Sítios de Ligação/genética , Cromatina/genética , Cromatina/metabolismo , Sequenciamento de Cromatina por Imunoprecipitação/estatística & dados numéricos , Biologia Computacional , Elementos Facilitadores Genéticos , Epigênese Genética , Regulação da Expressão Gênica , Humanos , Células K562 , Células MCF-7 , Modelos Estatísticos , Regiões Promotoras Genéticas
6.
Bioinformatics ; 37(20): 3456-3463, 2021 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-33983436

RESUMO

MOTIVATION: Detecting subtle biologically relevant patterns in protein sequences often requires the construction of a large and accurate multiple sequence alignment (MSA). Methods for constructing MSAs are usually evaluated using benchmark alignments, which, however, typically contain very few sequences and are therefore inappropriate when dealing with large numbers of proteins. RESULTS: eCOMPASS addresses this problem using a statistical measure of relative alignment quality based on direct coupling analysis (DCA): to maintain protein structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. eCOMPASS computes the statistical significance of the congruence between high scoring directly coupled pairs and 3D contacts in corresponding structures, which depends upon properly aligned homologous residues. We illustrate eCOMPASS using both simulated and real MSAs. AVAILABILITY AND IMPLEMENTATION: The eCOMPASS executable, C++ open source code and input data sets are available at https://www.igs.umaryland.edu/labs/neuwald/software/compass. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

7.
BMC Bioinformatics ; 22(1): 193, 2021 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-33858322

RESUMO

BACKGROUND: ChIP-seq combines chromatin immunoprecipitation assays with sequencing and identifies genome-wide binding sites for DNA binding proteins. While many binding sites have strong ChIP-seq 'peak' observations and are well captured, there are still regions bound by proteins weakly, with a relatively low ChIP-seq signal enrichment. These weak binding sites, especially those at promoters and enhancers, are functionally important because they also regulate nearby gene expression. Yet, it remains a challenge to accurately identify weak binding sites in ChIP-seq data due to the ambiguity in differentiating these weak binding sites from the amplified background DNAs. RESULTS: ChIP-BIT2 ( http://sourceforge.net/projects/chipbitc/ ) is a software package for ChIP-seq peak detection. ChIP-BIT2 employs a mixture model integrating protein and control ChIP-seq data and predicts strong or weak protein binding sites at promoters, enhancers, or other genomic locations. For binding sites at gene promoters, ChIP-BIT2 simultaneously predicts their target genes. ChIP-BIT2 has been validated on benchmark regions and tested using large-scale ENCODE ChIP-seq data, demonstrating its high accuracy and wide applicability. CONCLUSION: ChIP-BIT2 is an efficient ChIP-seq peak caller. It provides a better lens to examine weak binding sites and can refine or extend the existing binding site collection, providing additional regulatory regions for decoding the mechanism of gene expression regulation.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Teorema de Bayes , Sítios de Ligação , Imunoprecipitação da Cromatina , Análise de Sequência com Séries de Oligonucleotídeos , Análise de Sequência de DNA
8.
Sci Rep ; 11(1): 385, 2021 01 11.
Artigo em Inglês | MEDLINE | ID: mdl-33432018

RESUMO

Exploring complex modularization of intracellular signal transduction pathways is critical to understanding aberrant cellular responses during disease development and drug treatment. IMPALA (Inferred Modularization of PAthway LAndscapes) integrates information from high throughput gene expression experiments and genome-scale knowledge databases to identify aberrant pathway modules, thereby providing a powerful sampling strategy to reconstruct and explore pathway landscapes. Here IMPALA identifies pathway modules associated with breast cancer recurrence and Tamoxifen resistance. Focusing on estrogen-receptor (ER) signaling, IMPALA identifies alternative pathways from gene expression data of Tamoxifen treated ER positive breast cancer patient samples. These pathways were often interconnected through cytoplasmic genes such as IRS1/2, JAK1, YWHAZ, CSNK2A1, MAPK1 and HSP90AA1 and significantly enriched with ErbB, MAPK, and JAK-STAT signaling components. Characterization of the pathway landscape revealed key modules associated with ER signaling and with cell cycle and apoptosis signaling. We validated IMPALA-identified pathway modules using data from four different breast cancer cell lines including sensitive and resistant models to Tamoxifen. Results showed that a majority of genes in cell cycle/apoptosis modules that were up-regulated in breast cancer patients with short survivals (< 5 years) were also over-expressed in drug resistant cell lines, whereas the transcription factors JUN, FOS, and STAT3 were down-regulated in both patient and drug resistant cell lines. Hence, IMPALA identified pathways were associated with Tamoxifen resistance and an increased risk of breast cancer recurrence. The IMPALA package is available at https://dlrl.ece.vt.edu/software/ .


Assuntos
Neoplasias da Mama/patologia , Biologia Computacional , Recidiva Local de Neoplasia/genética , Algoritmos , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Resistencia a Medicamentos Antineoplásicos/genética , Feminino , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes/fisiologia , Genes BRCA1 , Humanos , Metástase Neoplásica , Recidiva Local de Neoplasia/metabolismo , Receptor ErbB-2/genética , Receptor ErbB-2/metabolismo , Receptores de Estrogênio/genética , Receptores de Estrogênio/metabolismo , Transdução de Sinais/genética , Tamoxifeno/farmacologia , Tamoxifeno/uso terapêutico
9.
Bioinformatics ; 37(5): 650-658, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-33016988

RESUMO

MOTIVATION: High-throughput RNA sequencing has revolutionized the scope and depth of transcriptome analysis. Accurate reconstruction of a phenotype-specific transcriptome is challenging due to the noise and variability of RNA-seq data. This requires computational identification of transcripts from multiple samples of the same phenotype, given the underlying consensus transcript structure. RESULTS: We present a Bayesian method, integrated assembly of phenotype-specific transcripts (IntAPT), that identifies phenotype-specific isoforms from multiple RNA-seq profiles. IntAPT features a novel two-layer Bayesian model to capture the presence of isoforms at the group layer and to quantify the abundance of isoforms at the sample layer. A spike-and-slab prior is used to model the isoform expression and to enforce the sparsity of expressed isoforms. Dependencies between the existence of isoforms and their expression are modeled explicitly to facilitate parameter estimation. Model parameters are estimated iteratively using Gibbs sampling to infer the joint posterior distribution, from which the presence and abundance of isoforms can reliably be determined. Studies using both simulations and real datasets show that IntAPT consistently outperforms existing methods for the IntAPT. Experimental results demonstrate that, despite sequencing errors, IntAPT exhibits a robust performance among multiple samples, resulting in notably improved identification of expressed isoforms of low abundance. AVAILABILITY AND IMPLEMENTATION: The IntAPT package is available at http://github.com/henryxushi/IntAPT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Perfilação da Expressão Gênica , Transcriptoma , Teorema de Bayes , Fenótipo , RNA-Seq , Análise de Sequência de RNA , Software
10.
Sci Rep ; 10(1): 16962, 2020 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-33028952

RESUMO

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

11.
Database (Oxford) ; 20202020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-32500917

RESUMO

For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease-endonuclease-phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.


Assuntos
Bases de Dados de Proteínas , Proteínas , Alinhamento de Sequência/métodos , Aprendizado de Máquina , Proteínas/química , Proteínas/genética , Análise de Sequência de Proteína , Software
12.
Sci Rep ; 10(1): 7960, 2020 05 14.
Artigo em Inglês | MEDLINE | ID: mdl-32409786

RESUMO

Genome-wide transcription factor (TF) binding signal analyses reveal co-localization of TF binding sites based on inferred cis-regulatory modules (CRMs). CRMs play a key role in understanding the cooperation of multiple TFs under specific conditions. However, the functions of CRMs and their effects on nearby gene transcription are highly dynamic and context-specific and therefore are challenging to characterize. BICORN (Bayesian Inference of COoperative Regulatory Network) builds a hierarchical Bayesian model and infers context-specific CRMs based on TF-gene binding events and gene expression data for a particular cell type. BICORN automatically searches for a list of candidate CRMs based on the input TF bindings at regulatory regions associated with genes of interest. Applying Gibbs sampling, BICORN iteratively estimates model parameters of CRMs, TF activities, and corresponding regulation on gene transcription, which it models as a sparse network of functional CRMs regulating target genes. The BICORN package is implemented in R (version 3.4 or later) and is publicly available on the CRAN server at https://cran.r-project.org/web/packages/BICORN/index.html.


Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes , Sequências Reguladoras de Ácido Nucleico/genética , Teorema de Bayes , Linhagem Celular , Humanos , Software
13.
Immunogenetics ; 72(3): 181-203, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-32002590

RESUMO

Toll-interleukin-1R resistance (TIR) domains are ubiquitously present in all forms of cellular life. They are most commonly found in signaling proteins, as units responsible for signal-dependent formation of protein complexes that enable amplification and spatial propagation of the signal. A less common function of TIR domains is their ability to catalyze nicotinamide adenine dinucleotide degradation. This survey analyzes 26,414 TIR domains, automatically classified based on group-specific sequence patterns presumably determining biological function, using a statistical approach termed Bayesian partitioning with pattern selection (BPPS). We examine these groups and patterns in the light of available structures and biochemical analyses. Proteins within each of thirteen eukaryotic groups (10 metazoans and 3 plants) typically appear to perform similar functions, whereas proteins within each prokaryotic group typically exhibit diverse domain architectures, suggesting divergent functions. Groups are often uniquely characterized by structural fold variations associated with group-specific sequence patterns and by herein identified sequence motifs defining TIR domain functional divergence. For example, BPPS identifies, in helices C and D of TIRAP and MyD88 orthologs, conserved surface-exposed residues apparently responsible for specificity of TIR domain interactions. In addition, BPPS clarifies the functional significance of the previously described Box 2 and Box 3 motifs, each of which is a part of a larger, group-specific block of conserved, intramolecularly interacting residues.


Assuntos
Proteínas Adaptadoras de Transdução de Sinal/genética , Domínios Proteicos/genética , Domínios Proteicos/fisiologia , Proteínas Adaptadoras de Transdução de Sinal/metabolismo , Sequência de Aminoácidos , Animais , Teorema de Bayes , Bases de Dados Genéticas , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Humanos , Interleucinas , Modelos Moleculares , Fator 88 de Diferenciação Mieloide/genética , Fator 88 de Diferenciação Mieloide/metabolismo , Estrutura Secundária de Proteína , Receptores de Interleucina-1/genética , Receptores de Interleucina-1/metabolismo , Transdução de Sinais/genética , Transdução de Sinais/fisiologia , Receptores Toll-Like/genética , Receptores Toll-Like/metabolismo
14.
Sci Rep ; 10(1): 1691, 2020 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-32015389

RESUMO

Protein functional constraints are manifest as superfamily and functional-subgroup conserved residues, and as pairwise correlations. Deep Analysis of Residue Constraints (DARC) aids the visualization of these constraints, characterizes how they correlate with each other and with structure, and estimates statistical significance. This can identify determinants of protein functional specificity, as we illustrate for bacterial DNA clamp loader ATPases. These load ring-shaped sliding clamps onto DNA to keep polymerase attached during replication and contain one δ, three γ, and one δ' AAA+ subunits semi-circularly arranged in the order δ-γ1-γ2-γ3-δ'. Only γ is active, though both γ and δ' functionally influence an adjacent γ subunit. DARC identifies, as functionally-congruent features linking allosterically the ATP, DNA, and clamp binding sites: residues distinctive of γ and of γ/δ' that mutually interact in trans, centered on the catalytic base; several γ/δ'-residues and six γ/δ'-covariant residue pairs within the DNA binding N-termini of helices α2 and α3; and γ/δ'-residues associated with the α2 C-terminus and the clamp-binding loop. Most notable is a trans-acting γ/δ' hydroxyl group that 99% of other AAA+ proteins lack. Mutation of this hydroxyl to a methyl group impedes clamp binding and opening, DNA binding, and ATP hydrolysis-implying a remarkably clamp-loader-specific function.


Assuntos
Proteínas de Ligação a DNA/metabolismo , Subunidades Proteicas/metabolismo , Adenosina Trifosfatases/metabolismo , Trifosfato de Adenosina/metabolismo , Sítios de Ligação/fisiologia , DNA Polimerase III/metabolismo , DNA Bacteriano/metabolismo , Escherichia coli/metabolismo , Hidrólise , Estrutura Secundária de Proteína , Sensibilidade e Especificidade
15.
Elife ; 72018 01 16.
Artigo em Inglês | MEDLINE | ID: mdl-29336305

RESUMO

Residues responsible for allostery, cooperativity, and other subtle but functionally important interactions remain difficult to detect. To aid such detection, we employ statistical inference based on the assumption that residues distinguishing a protein subgroup from evolutionarily divergent subgroups often constitute an interacting functional network. We identify such networks with the aid of two measures of statistical significance. One measure aids identification of divergent subgroups based on distinguishing residue patterns. For each subgroup, a second measure identifies structural interactions involving pattern residues. Such interactions are derived either from atomic coordinates or from Direct Coupling Analysis scores, used as surrogates for structural distances. Applying this approach to N-acetyltransferases, P-loop GTPases, RNA helicases, synaptojanin-superfamily phosphatases and nucleases, and thymine/uracil DNA glycosylases yielded results congruent with biochemical understanding of these proteins, and also revealed striking sequence-structural features overlooked by other methods. These and similar analyses can aid the design of drugs targeting allosteric sites.


Assuntos
Biologia Computacional/métodos , Enzimas/química , Enzimas/metabolismo , Conformação Proteica
16.
J Comput Biol ; 25(2): 121-129, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-28771374

RESUMO

We study a simple abstract problem motivated by a variety of applications in protein sequence analysis. Consider a string of 0s and 1s of length L, and containing D 1s. If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases. Within a structure of yeast elongation factor 1[Formula: see text], these residues form a significant cluster centered on a region implicated in guanine nucleotide exchange. Various biomedical questions may be cast as the abstract problem considered here.


Assuntos
Biologia Computacional/métodos , Fatores de Elongação Ligados a GTP Fosfo-Hidrolases/química , Proteínas de Saccharomyces cerevisiae/química , Análise de Sequência de Proteína/métodos , Análise por Conglomerados
17.
PLoS Comput Biol ; 14(12): e1006237, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30596639

RESUMO

Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.


Assuntos
Sítios de Ligação/fisiologia , Proteínas/química , Análise de Sequência de Proteína/métodos , Algoritmos , Modelos Moleculares , Conformação Proteica , Domínios e Motivos de Interação entre Proteínas/fisiologia , Elementos Estruturais de Proteínas , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos
18.
PLoS Comput Biol ; 12(12): e1005294, 2016 12.
Artigo em Inglês | MEDLINE | ID: mdl-28002465

RESUMO

Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes' theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu).


Assuntos
Acetiltransferases/química , Modelos Moleculares , Análise de Sequência de Proteína/métodos , Acetiltransferases/genética , Acetiltransferases/metabolismo , Sequência de Aminoácidos , Animais , Proteínas de Caenorhabditis elegans/química , Proteínas de Caenorhabditis elegans/genética , Proteínas de Caenorhabditis elegans/metabolismo , Biologia Computacional , Humanos , Cadeias de Markov , Método de Monte Carlo , Alinhamento de Sequência/métodos
19.
Curr Opin Struct Biol ; 38: 1-8, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-27179293

RESUMO

The availability of vast amounts of protein sequence data facilitates detection of subtle statistical correlations due to imposed structural and functional constraints. Recent breakthroughs using Direct Coupling Analysis (DCA) and related approaches have tapped into correlations believed to be due to compensatory mutations. This has yielded some remarkable results, including substantially improved prediction of protein intra- and inter-domain 3D contacts, of membrane and globular protein structures, of substrate binding sites, and of protein conformational heterogeneity. A complementary approach is Bayesian Partitioning with Pattern Selection (BPPS), which partitions related proteins into hierarchically-arranged subgroups based on correlated residue patterns. These correlated patterns are presumably due to structural and functional constraints associated with evolutionary divergence rather than to compensatory mutations. Hence joint application of DCA- and BPPS-based approaches should help sort out the structural and functional constraints contributing to sequence correlations.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Proteínas/metabolismo , Alinhamento de Sequência , Sequência de Aminoácidos , Bloqueio Interatrial , Modelos Moleculares
20.
PLoS Comput Biol ; 12(5): e1004936, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-27192614

RESUMO

We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/.


Assuntos
Proteínas/química , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Teorema de Bayes , Biologia Computacional , Bases de Dados de Proteínas , Cadeias de Markov , Método de Monte Carlo , Alinhamento de Sequência/normas , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...