Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
Microbiome ; 2: 15, 2014.
Article in English | MEDLINE | ID: mdl-24910773

ABSTRACT

BACKGROUND: Experimental designs that take advantage of high-throughput sequencing to generate datasets include RNA sequencing (RNA-seq), chromatin immunoprecipitation sequencing (ChIP-seq), sequencing of 16S rRNA gene fragments, metagenomic analysis and selective growth experiments. In each case the underlying data are similar and are composed of counts of sequencing reads mapped to a large number of features in each sample. Despite this underlying similarity, the data analysis methods used for these experimental designs are all different, and do not translate across experiments. Alternative methods have been developed in the physical and geological sciences that treat similar data as compositions. Compositional data analysis methods transform the data to relative abundances with the result that the analyses are more robust and reproducible. RESULTS: Data from an in vitro selective growth experiment, an RNA-seq experiment and the Human Microbiome Project 16S rRNA gene abundance dataset were examined by ALDEx2, a compositional data analysis tool that uses Bayesian methods to infer technical and statistical error. The ALDEx2 approach is shown to be suitable for all three types of data: it correctly identifies both the direction and differential abundance of features in the differential growth experiment, it identifies a substantially similar set of differentially expressed genes in the RNA-seq dataset as the leading tools and it identifies as differential the taxa that distinguish the tongue dorsum and buccal mucosa in the Human Microbiome Project dataset. The design of ALDEx2 reduces the number of false positive identifications that result from datasets composed of many features in few samples. CONCLUSION: Statistical analysis of high-throughput sequencing datasets composed of per feature counts showed that the ALDEx2 R package is a simple and robust tool, which can be applied to RNA-seq, 16S rRNA gene sequencing and differential growth datasets, and by extension to other techniques that use a similar approach.

2.
PLoS One ; 8(7): e67019, 2013.
Article in English | MEDLINE | ID: mdl-23843979

ABSTRACT

Experimental variance is a major challenge when dealing with high-throughput sequencing data. This variance has several sources: sampling replication, technical replication, variability within biological conditions, and variability between biological conditions. The high per-sample cost of RNA-Seq often precludes the large number of experiments needed to partition observed variance into these categories as per standard ANOVA models. We show that the partitioning of within-condition to between-condition variation cannot reasonably be ignored, whether in single-organism RNA-Seq or in Meta-RNA-Seq experiments, and further find that commonly-used RNA-Seq analysis tools, as described in the literature, do not enforce the constraint that the sum of relative expression levels must be one, and thus report expression levels that are systematically distorted. These two factors lead to misleading inferences if not properly accommodated. As it is usually only the biological between-condition and within-condition differences that are of interest, we developed ALDEx, an ANOVA-like differential expression procedure, to identify genes with greater between- to within-condition differences. We show that the presence of differential expression and the magnitude of these comparative differences can be reasonably estimated with even very small sample sizes.


Subject(s)
High-Throughput Nucleotide Sequencing/statistics & numerical data , Metagenome , RNA, Bacterial/analysis , RNA/analysis , Sequence Analysis, RNA/statistics & numerical data , Analysis of Variance , Bacillus cereus/genetics , Bacillus cereus/metabolism , Gardnerella vaginalis/genetics , Gardnerella vaginalis/metabolism , Gene Expression , Gene Expression Profiling , Humans , Kidney/cytology , Kidney/metabolism , Lactobacillus/genetics , Lactobacillus/metabolism , Liver/cytology , Liver/metabolism , Megasphaera/genetics , Megasphaera/metabolism , Prevotella/genetics , Prevotella/metabolism , Reproducibility of Results , Sample Size
3.
Microbiome ; 1(1): 12, 2013 Apr 12.
Article in English | MEDLINE | ID: mdl-24450540

ABSTRACT

BACKGROUND: Bacterial vaginosis (BV), the most common vaginal condition of reproductive-aged women, is associated with a highly diverse and heterogeneous microbiota. Here we present a proof-of-principle analysis to uncover the function of the microbiota using meta-RNA-seq to uncover genes and pathways that potentially differentiate healthy vaginal microbial communities from those in the dysbiotic state of bacterial vaginosis (BV). RESULTS: The predominant organism, Lactobacillus iners, was present in both conditions and showed a differing expression profile in BV compared to healthy. Despite its minimal genome, L. iners differentially expressed over 10% of its gene complement. Notably, in a BV environment L. iners increased expression of a cholesterol-dependent cytolysin, and of mucin and glycerol transport and related metabolic enzymes. Genes belonging to a CRISPR system were greatly upregulated suggesting that bacteriophage influence the community. Reflective of L. iners, the bacterial community as a whole demonstrated a preference for glycogen and glycerol as carbon sources under BV conditions. The predicted end-products of metabolism under BV conditions include an abundance of succinate and other short-chain fatty-acids, while healthy conditions are predicted to largely contain lactic acid. CONCLUSIONS: Our study underscores the importance of understanding the functional activity of the bacterial community in addition to characterizing the population structure when investigating the human microbiome.

5.
PLoS One ; 6(8): e23804, 2011.
Article in English | MEDLINE | ID: mdl-21887323

ABSTRACT

Homing endonucleases are site-specific DNA endonucleases that function as mobile genetic elements by introducing double-strand breaks or nicks at defined locations. Of the major families of homing endonucleases, the modular GIY-YIG endonucleases are least understood in terms of mechanism. The GIY-YIG homing endonuclease I-BmoI generates a double-strand break by sequential nicking reactions during which the single active site of the GIY-YIG nuclease domain must undergo a substantial reorganization. Here, we show that divalent metal ion plays a significant role in regulating the two independent nicking reactions by I-BmoI. Rate constant determination for each nicking reaction revealed that limiting divalent metal ion has a greater impact on the second strand than the first strand nicking reaction. We also show that substrate mutations within the I-BmoI cleavage site can modulate the first strand nicking reaction over a 314-fold range. Additionally, in-gel DNA footprinting with mutant substrates and modeling of an I-BmoI-substrate complex suggest that amino acid contacts to a critical GC-2 base pair are required to induce a bottom-strand distortion that likely directs conformational changes for reaction progress. Collectively, our data implies mechanistic roles for divalent metal ion and substrate bases, suggesting that divalent metal ion facilitates the re-positioning of the GIY-YIG nuclease domain between sequential nicking reactions.


Subject(s)
DNA Breaks, Single-Stranded/drug effects , Endodeoxyribonucleases/metabolism , Metals/pharmacology , Catalytic Domain , Cations, Divalent/pharmacology , DNA Footprinting , Kinetics
6.
PLoS One ; 5(10): e15406, 2010 Oct 26.
Article in English | MEDLINE | ID: mdl-21048977

ABSTRACT

We developed a low-cost, high-throughput microbiome profiling method that uses combinatorial sequence tags attached to PCR primers that amplify the rRNA V6 region. Amplified PCR products are sequenced using an Illumina paired-end protocol to generate millions of overlapping reads. Combinatorial sequence tagging can be used to examine hundreds of samples with far fewer primers than is required when sequence tags are incorporated at only a single end. The number of reads generated permitted saturating or near-saturating analysis of samples of the vaginal microbiome. The large number of reads allowed an in-depth analysis of errors, and we found that PCR-induced errors composed the vast majority of non-organism derived species variants, an observation that has significant implications for sequence clustering of similar high-throughput data. We show that the short reads are sufficient to assign organisms to the genus or species level in most cases. We suggest that this method will be useful for the deep sequencing of any short nucleotide region that is taxonomically informative; these include the V3, V5 regions of the bacterial 16S rRNA genes and the eukaryotic V9 region that is gaining popularity for sampling protist diversity.


Subject(s)
Gene Expression Profiling , Microbiology , Polymerase Chain Reaction/methods , Base Sequence , DNA Primers , Species Specificity
7.
Algorithms Mol Biol ; 5: 35, 2010 Nov 08.
Article in English | MEDLINE | ID: mdl-21059250

ABSTRACT

BACKGROUND: Unigenic evolution is a large-scale mutagenesis experiment used to identify residues that are potentially important for protein function. Both currently-used methods for the analysis of unigenic evolution data analyze 'windows' of contiguous sites, a strategy that increases statistical power but incorrectly assumes that functionally-critical sites are contiguous. In addition, both methods require the questionable assumption of asymptotically-large sample size due to the presumption of approximate normality. RESULTS: We develop a novel approach, termed the Evidence of Selection (EoS), removing the assumption that functionally important sites are adjacent in sequence and and explicitly modelling the effects of limited sample-size. Precise statistical derivations show that the EoS score can be easily interpreted as an expected log-odds-ratio between two competing hypotheses, namely, the hypothetical presence or absence of functional selection for a given site. Using the EoS score, we then develop selection criteria by which functionally-important yet non-adjacent sites can be identified. An approximate power analysis is also developed to estimate the reliability of inference given the data. We validate and demonstrate the the practical utility of our method by analysis of the homing endonuclease I-Bmol, comparing our predictions with the results of existing methods. CONCLUSIONS: Our method is able to assess both the evidence of selection at individual amino acid sites and estimate the reliability of those inferences. Experimental validation with I-Bmol proves its utility to identify functionally-important residues of poorly characterized proteins, demonstrating increased sensitivity over previous methods without loss of specificity. With the ability to guide the selection of precise experimental mutagenesis conditions, our method helps make unigenic analysis a more broadly applicable technique with which to probe protein function. AVAILABILITY: Software to compute, plot, and summarize EoS data is available as an open-source package called 'unigenic' for the 'R' programming language at http://www.fernandes.org/txp/article/13/an-analytical-framework-for-unigenic-evolution.

8.
PLoS One ; 5(8): e12078, 2010 Aug 12.
Article in English | MEDLINE | ID: mdl-20711427

ABSTRACT

BACKGROUND: Women living with HIV and co-infected with bacterial vaginosis (BV) are at higher risk for transmitting HIV to a partner or newborn. It is poorly understood which bacterial communities constitute BV or the normal vaginal microbiota among this population and how the microbiota associated with BV responds to antibiotic treatment. METHODS AND FINDINGS: The vaginal microbiota of 132 HIV positive Tanzanian women, including 39 who received metronidazole treatment for BV, were profiled using Illumina to sequence the V6 region of the 16S rRNA gene. Of note, Gardnerella vaginalis and Lactobacillus iners were detected in each sample constituting core members of the vaginal microbiota. Eight major clusters were detected with relatively uniform microbiota compositions. Two clusters dominated by L. iners or L. crispatus were strongly associated with a normal microbiota. The L. crispatus dominated microbiota were associated with low pH, but when L. crispatus was not present, a large fraction of L. iners was required to predict a low pH. Four clusters were strongly associated with BV, and were dominated by Prevotella bivia, Lachnospiraceae, or a mixture of different species. Metronidazole treatment reduced the microbial diversity and perturbed the BV-associated microbiota, but rarely resulted in the establishment of a lactobacilli-dominated microbiota. CONCLUSIONS: Illumina based microbial profiling enabled high though-put analyses of microbial samples at a high phylogenetic resolution. The vaginal microbiota among women living with HIV in Sub-Saharan Africa constitutes several profiles associated with a normal microbiota or BV. Recurrence of BV frequently constitutes a different BV-associated profile than before antibiotic treatment.


Subject(s)
HIV Infections/microbiology , Metagenome/genetics , Vagina/microbiology , Adolescent , Adult , Anti-Bacterial Agents/pharmacology , Female , Gardnerella vaginalis/drug effects , Gardnerella vaginalis/genetics , HIV Infections/complications , Humans , Hydrogen-Ion Concentration , Lactobacillus/drug effects , Lactobacillus/genetics , Metagenome/drug effects , Metronidazole/pharmacology , Middle Aged , RNA, Bacterial/genetics , RNA, Ribosomal, 16S/genetics , Sequence Analysis, DNA , Vagina/chemistry , Vaginosis, Bacterial/complications , Vaginosis, Bacterial/diagnosis , Vaginosis, Bacterial/microbiology , Young Adult
9.
PLoS One ; 5(6): e11082, 2010 Jun 28.
Article in English | MEDLINE | ID: mdl-20596526

ABSTRACT

BACKGROUND: There is currently no way to verify the quality of a multiple sequence alignment that is independent of the assumptions used to build it. Sequence alignments are typically evaluated by a number of established criteria: sequence conservation, the number of aligned residues, the frequency of gaps, and the probable correct gap placement. Covariation analysis is used to find putatively important residue pairs in a sequence alignment. Different alignments of the same protein family give different results demonstrating that covariation depends on the quality of the sequence alignment. We thus hypothesized that current criteria are insufficient to build alignments for use with covariation analyses. METHODOLOGY/PRINCIPAL FINDINGS: We show that current criteria are insufficient to build alignments for use with covariation analyses as systematic sequence alignment errors are present even in hand-curated structure-based alignment datasets like those from the Conserved Domain Database. We show that current non-parametric covariation statistics are sensitive to sequence misalignments and that this sensitivity can be used to identify systematic alignment errors. We demonstrate that removing alignment errors due to 1) improper structure alignment, 2) the presence of paralogous sequences, and 3) partial or otherwise erroneous sequences, improves contact prediction by covariation analysis. Finally we describe two non-parametric covariation statistics that are less sensitive to sequence alignment errors than those described previously in the literature. CONCLUSIONS/SIGNIFICANCE: Protein alignments with errors lead to false positive and false negative conclusions (incorrect assignment of covariation and conservation, respectively). Covariation analysis can provide a verification step, independent of traditional criteria, to identify systematic misalignments in protein alignments. Two non-parametric statistics are shown to be somewhat insensitive to misalignment errors, providing increased confidence in contact prediction when analyzing alignments with erroneous regions because of an emphasis on they emphasize pairwise covariation over group covariation.


Subject(s)
Proteins/chemistry , Sequence Alignment
10.
Bioinformatics ; 26(9): 1135-9, 2010 May 01.
Article in English | MEDLINE | ID: mdl-20236946

ABSTRACT

MOTIVATION: Mutual information (MI) is a quantity that measures the dependence between two arbitrary random variables and has been repeatedly used to solve a wide variety of bioinformatic problems. Recently, when attempting to quantify the effects of sampling variance on computed values of MI in proteins, we encountered striking differences among various novel estimates of MI. These differences revealed that estimating the 'true' value of MI is not a straightforward procedure, and minor variations of assumptions yielded remarkably different estimates. RESULTS: We describe four formally equivalent estimates of MI, three of which explicitly account for sampling variance, that yield non-equal values of MI given exact frequencies. These MI estimates are essentially non-predictive of each other, converging only in the limit of implausibly large datasets. Lastly, we show that all four estimates are biologically reasonable estimates of MI, despite their disparity, since each is actually the Kullback-Leibler divergence between random variables conditioned on equally plausible hypotheses. CONCLUSIONS: For sparse contingency tables of the type universally observed in protein coevolution studies, our results show that estimates of MI, and hence inferences about physical phenomena such as coevolution, are critically dependent on at least three prior assumptions. These assumptions are: (i) how observation counts relate to expected frequencies; (ii) the relationship between joint and marginal frequencies; and (iii) how non-observed categories are interpreted. In any biologically relevant data, these assumptions will affect the MI estimate as much or more-so than observed data, and are independent of uncertainty in frequency parameters.


Subject(s)
Computational Biology/methods , Algorithms , Bayes Theorem , Databases, Protein , Escherichia coli/genetics , Models, Statistical , Models, Theoretical , Probability , Software
11.
Nucleic Acids Res ; 38(7): 2411-27, 2010 Apr.
Article in English | MEDLINE | ID: mdl-20061372

ABSTRACT

Insight into protein structure and function is best obtained through a synthesis of experimental, structural and bioinformatic data. Here, we outline a framework that we call MUSE (mutual information, unigenic evolution and structure-guided elucidation), which facilitated the identification of previously unknown residues that are relevant for function of the GIY-YIG homing endonuclease I-BmoI. Our approach synthesizes three types of data: mutual information analyses that identify co-evolving residues within the GIY-YIG catalytic domain; a unigenic evolution strategy that identifies hyper- and hypo-mutable residues of I-BmoI; and interpretation of the unigenic and co-evolution data using a homology model. In particular, we identify novel positions within the GIY-YIG domain as functionally important. Proof-of-principle experiments implicate the non-conserved I71 as functionally relevant, with an I71N mutant accumulating a nicked cleavage intermediate. Moreover, many additional positions within the catalytic, linker and C-terminal domains of I-BmoI were implicated as important for function. Our results represent a platform on which to pursue future studies of I-BmoI and other GIY-YIG-containing proteins, and demonstrate that MUSE can successfully identify novel functionally critical residues that would be ignored in a traditional structure-function analysis within an extensively studied small domain of approximately 90 amino acids.


Subject(s)
Computational Biology/methods , Endodeoxyribonucleases/chemistry , Amino Acids/chemistry , Catalytic Domain , DNA Cleavage , Data Interpretation, Statistical , Endodeoxyribonucleases/genetics , Endodeoxyribonucleases/metabolism , Evolution, Molecular , Models, Molecular , Mutagenesis, Site-Directed , Mutation , Protein Structure, Tertiary , Sequence Alignment
12.
Mol Biol Evol ; 27(5): 1181-91, 2010 May.
Article in English | MEDLINE | ID: mdl-20065119

ABSTRACT

We demonstrated that a pair of positions in phosphoglycerate kinase that score highly by three nonparametric covariation measures are important for function even though the positions can be occupied by aliphatic, aromatic, or charged residues. Examination of these pairs suggested that the majority of the covariation scores could be explained by within-clade conservation. However, an analysis of diversity showed that the conservation within clades of covarying pairs was indistinguishable from pairs of positions that do not covary, thus ruling out both clade conservation and extensive homoplasy as means to identify covarying positions. Mutagenesis showed that the residues in the covarying pair were epistatic, with the type of epistasis being dependent on the initial pair. The results show that nonconserved covarying positions that affect protein function can be identified with high precision.


Subject(s)
Conserved Sequence , Evolution, Molecular , Phylogeny , Sequence Homology, Amino Acid , Amino Acid Sequence , Amino Acids/genetics , Databases, Protein , Models, Genetic , Models, Molecular , Molecular Sequence Data , Mutagenesis/genetics , Mutant Proteins/metabolism , Phosphoglycerate Kinase/chemistry , Protein Denaturation , Saccharomyces cerevisiae/cytology , Saccharomyces cerevisiae/enzymology , Saccharomyces cerevisiae/growth & development , Sequence Alignment , Temperature
13.
BMC Genomics ; 9: 468, 2008 Oct 08.
Article in English | MEDLINE | ID: mdl-18842153

ABSTRACT

BACKGROUND: Pseudoautosomal regions (PAR1 and PAR2) in eutherians retain homologous regions between the X and Y chromosomes that play a critical role in the obligatory X-Y crossover during male meiosis. Genes that reside in the PAR1 are exceptional in that they are rich in repetitive sequences and undergo a very high rate of recombination. Remarkably, murine PAR1 homologs have translocated to various autosomes, reflecting the complex recombination history during the evolution of the mammalian X chromosome. RESULTS: We now report that the SNF2-type chromatin remodeling protein ATRX controls the expression of eutherian ancestral PAR1 genes that have translocated to autosomes in the mouse. In addition, we have identified two potentially novel mouse PAR1 orthologs. CONCLUSION: We propose that the ancestral PAR1 genes share a common epigenetic environment that allows ATRX to control their expression.


Subject(s)
Chromatin Assembly and Disassembly , DNA Helicases/genetics , Genome , Nuclear Proteins/genetics , Translocation, Genetic , Amino Acid Sequence , Animals , Cells, Cultured , Evolution, Molecular , Gene Deletion , Gene Expression Profiling , Humans , Mice , Molecular Sequence Data , Oligonucleotide Array Sequence Analysis , Phylogeny , Prosencephalon/growth & development , RNA/genetics , RNA Interference , Reverse Transcriptase Polymerase Chain Reaction , Sequence Alignment , X-linked Nuclear Protein
14.
Bioinformatics ; 24(19): 2177-83, 2008 Oct 01.
Article in English | MEDLINE | ID: mdl-18662926

ABSTRACT

MOTIVATION: In a nucleotide or amino acid sequence, not all sites evolve at the same rate, due to differing selective constraints at each site. Currently in computational molecular evolution, models incorporating rate heterogeneity always share two assumptions. First, the rate of evolution at each site is assumed to be independent of every other site. Second, the values of these rates are assumed to be drawn from a known prior distribution. Although often assumed to be small, the actual effect of these assumptions has not been previously quantified in the literature. RESULTS: Herein we describe an algorithm to simultaneously infer the set of n-1 relative rates that parameterize the likelihood of an n-site alignment. Unlike previous work (a) these relative rates are completely identifiable and distinct from the branch-length parameters, and (b) a far more general class of rate priors can be used, and their effects quantified. Although described in a Bayesian framework, we discuss a future maximum likelihood extension. CONCLUSIONS: Using both synthetic data and alignments from the Myc, Max and p53 protein families, we find that inferring relative rather than absolute rates has several advantages. First, both empirical likelihoods and Bayes factors show strong preference for the relative-rate model, with a mean Delta ln P=-0.458 per alignment site. Second, the computed likelihoods and Bayes factors were essentially independent of the relative-rate prior, indicating that good estimates of the posterior rate distribution are not required a priori. Third, a novel finding is that rates can be accurately inferred even when up to approximately 4 substitutions per site have occurred. Thus biologically relevant putative hypervariable sites can be identified as easily as conserved sites. Lastly, our model treats rates and tree branch-lengths as completely identifiable, allowing for the first time coherent simultaneous inference of branch-lengths and site-specific evolutionary rates. AVAILABILITY: Source code for the utility described is available under a BSD-style license at http://www.fernandes.org/txp/article/9/site-specific-relative-evolutionary-rates.


Subject(s)
Algorithms , Evolution, Molecular , Proteins/chemistry , Animals , Basic-Leucine Zipper Transcription Factors/chemistry , Basic-Leucine Zipper Transcription Factors/genetics , Computational Biology , Databases, Protein , Genes, myc , Humans , Proteins/genetics , Sequence Alignment , Tumor Suppressor Protein p53/chemistry , Tumor Suppressor Protein p53/genetics
15.
J Mol Evol ; 67(1): 51-67, 2008 Jul.
Article in English | MEDLINE | ID: mdl-18560747

ABSTRACT

The tumor suppressor p53 is mutated in approximately 50% of all human cancer cases worldwide. It is commonly assumed that the phylogenetic history of this important tumor suppressor has been thoroughly studied; however, few detailed studies of the entire extended p53 protein family have been reported, and none comprehensively and simultaneously consider functional, molecular, and phylogenetic data. Herein we examine a diverse collection of reported p53-like protein sequences, including representatives from the arthropods, nematodes, and protists, with the goal of answering several important questions. First, what evidence supports these highly divergent proteins being true homologues to the p53 family? Second, is the inferred overall family phylogeny concordant with known structures and functions? Third, does the extended p53 family possess recognizable conserved sites outside of the within-chordate, highly-conserved DNA-binding domain? Our study shows that the biochemical and functional evidence of p53 homology for nematodes, arthropods, and protists is inconsistent with their implied phylogenetic relationship within the overall family. Although these divergent sequences are always reported as functionally similar to human p53, our results confirm and extend the hypothesis that p63 is a far more appropriate protein for comparison. Within these divergent sequences, we find minimal conservation within the DNA-binding domain, and no conservation elsewhere. Taken together, our findings suggest that these sequences are not bona fide homologues of the extended p53 family and provide baseline criteria for the future identification and characterization of distant p53-family homologues.


Subject(s)
Phylogeny , Tumor Suppressor Protein p53/classification , Animals , Arthropods/genetics , DNA-Binding Proteins/classification , Evolution, Molecular , Genes, p53 , Mollusca/genetics , Nematoda/genetics , Nuclear Proteins/classification , Sequence Homology, Amino Acid , Tumor Protein p73 , Tumor Suppressor Protein p53/chemistry , Tumor Suppressor Proteins/classification , Urochordata/genetics
16.
Evol Bioinform Online ; 2: 251-9, 2007 Feb 15.
Article in English | MEDLINE | ID: mdl-19455218

ABSTRACT

We present computational methods and subroutines to compute Gaussian quadrature integration formulas for arbitrary positive measures. For expensive integrands that can be factored into well-known forms, Gaussian quadrature schemes allow for efficient evaluation of high-accuracy and -precision numerical integrals, especially compared to general ad hoc schemes. In addition, for certain well-known density measures (the normal, gamma, log-normal, Student's t, inverse-gamma, beta, and Fisher's F) we present exact formulae for computing the respective quadrature scheme.

17.
Proc Natl Acad Sci U S A ; 102(18): 6395-400, 2005 May 03.
Article in English | MEDLINE | ID: mdl-15851683

ABSTRACT

Biological sequences are composed of long strings of alphabetic letters rather than arrays of numerical values. Lack of a natural underlying metric for comparing such alphabetic data significantly inhibits sophisticated statistical analyses of sequences, modeling structural and functional aspects of proteins, and related problems. Herein, we use multivariate statistical analyses on almost 500 amino acid attributes to produce a small set of highly interpretable numeric patterns of amino acid variability. These high-dimensional attribute data are summarized by five multidimensional patterns of attribute covariation that reflect polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. Numerical scores for each amino acid then transform amino acid sequences for statistical analyses. Relationships between transformed data and amino acid substitution matrices show significant associations for polarity and codon diversity scores. Transformed alphabetic data are used in analysis of variance and discriminant analysis to study DNA binding in the basic helix-loop-helix proteins. The transformed scores offer a general solution for analyzing a wide variety of sequence analysis problems.


Subject(s)
Amino Acid Sequence/genetics , Computational Biology/methods , Genetic Variation , Models, Genetic , Phylogeny , Statistics as Topic/methods , Analysis of Variance , Cluster Analysis , Codon/genetics , Discriminant Analysis , Multivariate Analysis , Protein Conformation , Static Electricity
18.
Proc Natl Acad Sci U S A ; 102(18): 6401-6, 2005 May 03.
Article in English | MEDLINE | ID: mdl-15851686

ABSTRACT

Accurate identification of specific groups of proteins by their amino acid sequence is an important goal in genome research. Here we combine information theory with fuzzy logic search procedures to identify sequence signatures or predictive motifs for members of the Myc-Max-Mad transcription factor network. Myc is a well known oncoprotein, and this family is involved in cell proliferation, apoptosis, and differentiation. We describe a small set of amino acid sites from the N-terminal portion of the basic helix-loop-helix (bHLH) domain that provide very accurate sequence signatures for the Myc-Max-Mad transcription factor network and three of its member proteins. A predictive motif involving 28 contiguous bHLH sequence elements found 337 network proteins in the GenBank NR database with no mismatches or misidentifications. This motif also identifies at least one previously unknown fungal protein with strong affinity to the Myc-Max-Mad network. Another motif found 96% of known Myc protein sequences with only a single mismatch, including sequences from genomes previously not thought to contain Myc proteins. The predictive motif for Myc is very similar to the ancestral sequence for the Myc group estimated from phylogenetic analyses. Based on available crystal structure studies, this motif is discussed in terms of its functional consequences. Our results provide insight into evolutionary diversification of DNA binding and dimerization in a well characterized family of regulatory proteins and provide a method of identifying signature motifs in protein families.


Subject(s)
Amino Acid Sequence/genetics , DNA-Binding Proteins/genetics , Genetic Variation , Helix-Loop-Helix Motifs/genetics , Proto-Oncogene Proteins c-myc/genetics , Repressor Proteins/genetics , Transcription Factors/genetics , Basic-Leucine Zipper Transcription Factors , Computational Biology/methods , Fuzzy Logic , Genomics/methods , Molecular Sequence Data , Protein Conformation , Sequence Alignment , Sequence Homology
SELECTION OF CITATIONS
SEARCH DETAIL
...