Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
PLoS One ; 7(10): e46688, 2012.
Article in English | MEDLINE | ID: mdl-23056405

ABSTRACT

As next-generation sequencing projects generate massive genome-wide sequence variation data, bioinformatics tools are being developed to provide computational predictions on the functional effects of sequence variations and narrow down the search of casual variants for disease phenotypes. Different classes of sequence variations at the nucleotide level are involved in human diseases, including substitutions, insertions, deletions, frameshifts, and non-sense mutations. Frameshifts and non-sense mutations are likely to cause a negative effect on protein function. Existing prediction tools primarily focus on studying the deleterious effects of single amino acid substitutions through examining amino acid conservation at the position of interest among related sequences, an approach that is not directly applicable to insertions or deletions. Here, we introduce a versatile alignment-based score as a new metric to predict the damaging effects of variations not limited to single amino acid substitutions but also in-frame insertions, deletions, and multiple amino acid substitutions. This alignment-based score measures the change in sequence similarity of a query sequence to a protein sequence homolog before and after the introduction of an amino acid variation to the query sequence. Our results showed that the scoring scheme performs well in separating disease-associated variants (n = 21,662) from common polymorphisms (n = 37,022) for UniProt human protein variations, and also in separating deleterious variants (n = 15,179) from neutral variants (n = 17,891) for UniProt non-human protein variations. In our approach, the area under the receiver operating characteristic curve (AUC) for the human and non-human protein variation datasets is ∼0.85. We also observed that the alignment-based score correlates with the deleteriousness of a sequence variation. In summary, we have developed a new algorithm, PROVEAN (Protein Variation Effect Analyzer), which provides a generalized approach to predict the functional effects of protein sequence variations including single or multiple amino acid substitutions, and in-frame insertions and deletions. The PROVEAN tool is available online at http://provean.jcvi.org.


Subject(s)
Amino Acid Substitution/genetics , Computational Biology/methods , INDEL Mutation/genetics , Animals , Databases, Genetic , Genome, Human/genetics , Humans , Mutation
2.
J Gen Virol ; 93(Pt 11): 2387-2398, 2012 Nov.
Article in English | MEDLINE | ID: mdl-22837419

ABSTRACT

This study compared the complete genome sequences of 16 NL63 strain human coronaviruses (hCoVs) from respiratory specimens of paediatric patients with respiratory disease in Colorado, USA, and characterized the epidemiology and clinical characteristics associated with circulating NL63 viruses over a 3-year period. From 1 January 2009 to 31 December 2011, 92 of 9380 respiratory specimens were found to be positive for NL63 RNA by PCR, an overall prevalence of 1 %. NL63 viruses were circulating during all 3 years, but there was considerable yearly variation in prevalence and the month of peak incidence. Phylogenetic analysis comparing the genome sequences of the 16 Colorado NL63 viruses with those of the prototypical hCoV-NL63 and three other NL63 viruses from the Netherlands demonstrated that there were three genotypes (A, B and C) circulating in Colorado from 2005 to 2010, and evidence of recombination between virus strains was found. Genotypes B and C co-circulated in Colorado in 2005, 2009 and 2010, but genotype A circulated only in 2005 when it was the predominant NL63 strain. Genotype C represents a new lineage that has not been described previously. The greatest variability in the NL63 virus genomes was found in the N-terminal domain (NTD) of the spike gene (nt 1-600, aa 1-200). Ten different amino acid sequences were found in the NTD of the spike protein among these NL63 strains and the 75 partial published sequences of NTDs from strains found at different times throughout the world.


Subject(s)
Coronavirus NL63, Human/genetics , Genetic Variation , Genotype , Membrane Glycoproteins/genetics , Recombination, Genetic , Viral Envelope Proteins/genetics , Adolescent , Child , Child, Preschool , Colorado/epidemiology , Coronavirus Infections/epidemiology , Coronavirus Infections/virology , Female , Genome, Viral , Humans , Infant , Infant, Newborn , Male , Molecular Sequence Data , Phylogeny , Protein Structure, Tertiary , Spike Glycoprotein, Coronavirus , Time Factors
3.
Proc Natl Acad Sci U S A ; 108(20): 8329-34, 2011 May 17.
Article in English | MEDLINE | ID: mdl-21536867

ABSTRACT

A whole-genome phylogeny of the Escherichia coli/Shigella group was constructed by using the feature frequency profile (FFP) method. This alignment-free approach uses the frequencies of l-mer features of whole genomes to infer phylogenic distances. We present two phylogenies that accentuate different aspects of E. coli/Shigella genomic evolution: (i) one based on the compositions of all possible features of length l = 24 (∼8.4 million features), which are likely to reveal the phenetic grouping and relationship among the organisms and (ii) the other based on the compositions of core features with low frequency and low variability (∼0.56 million features), which account for ∼69% of all commonly shared features among 38 taxa examined and are likely to have genome-wide lineal evolutionary signal. Shigella appears as a single clade when all possible features are used without filtering of noncore features. However, results using core features show that Shigella consists of at least two distantly related subclades, implying that the subclades evolved into a single clade because of a high degree of convergence influenced by mobile genetic elements and niche adaptation. In both FFP trees, the basal group of the E. coli/Shigella phylogeny is the B2 phylogroup, which contains primarily uropathogenic strains, suggesting that the E. coli/Shigella ancestor was likely a facultative or opportunistic pathogen. The extant commensal strains diverged relatively late and appear to be the result of reductive evolution of genomes. We also identify clade distinguishing features and their associated genomic regions within each phylogroup. Such features may provide useful information for understanding evolution of the groups and for quick diagnostic identification of each phylogroup.


Subject(s)
Escherichia coli/genetics , Genome, Bacterial , Models, Genetic , Phylogeny , Shigella/genetics , Biological Evolution
4.
Proc Natl Acad Sci U S A ; 107(1): 133-8, 2010 Jan 05.
Article in English | MEDLINE | ID: mdl-20018669

ABSTRACT

We present a whole-proteome phylogeny of prokaryotes constructed by comparing feature frequency profiles (FFPs) of whole proteomes. Features are l-mers of amino acids, and each organism is represented by a profile of frequencies of all features. The selection of feature length is critical in the FFP method, and we have developed a procedure for identifying the optimal feature lengths for inferring the phylogeny of prokaryotes, strictly speaking, a proteome phylogeny. Our FFP trees are constructed with whole proteomes of 884 prokaryotes, 16 unicellular eukaryotes, and 2 random sequences. To highlight the branching order of major groups, we present a simplified proteome FFP tree of monophyletic class or phylum with branch support. In our whole-proteome FFP trees (i) Archaea, Bacteria, Eukaryota, and a random sequence outgroup are clearly separated; (ii) Archaea and Bacteria form a sister group when rooted with random sequences; (iii) Planctomycetes, which possesses an intracellular membrane compartment, is placed at the basal position of the Bacteria domain; (iv) almost all groups are monophyletic in prokaryotes at most taxonomic levels, but many differences in the branching order of major groups are observed between our proteome FFP tree and trees built with other methods; and (v) previously "unclassified" genomes may be assigned to the most likely taxa. We describe notable similarities and differences between our FFP trees and those based on other methods in grouping and phylogeny of prokaryotes.


Subject(s)
Phylogeny , Prokaryotic Cells , Proteome/genetics , Proteomics/methods , Sequence Analysis, Protein/methods , Genome , Prokaryotic Cells/classification , Prokaryotic Cells/physiology , Sequence Alignment/methods
5.
Proc Natl Acad Sci U S A ; 106(40): 17077-82, 2009 Oct 06.
Article in English | MEDLINE | ID: mdl-19805074

ABSTRACT

Ten complete mammalian genome sequences were compared by using the "feature frequency profile" (FFP) method of alignment-free comparison. This comparison technique reveals that the whole nongenic portion of mammalian genomes contains evolutionary information that is similar to their genic counterparts--the intron and exon regions. We partitioned the complete genomes of mammals (such as human, chimp, horse, and mouse) into their constituent nongenic, intronic, and exonic components. Phylogenic species trees were constructed for each individual component class of genome sequence data as well as the whole genomes by using standard tree-building algorithms with FFP distances. The phylogenies of the whole genomes and each of the component classes (exonic, intronic, and nongenic regions) have similar topologies, within the optimal feature length range, and all agree well with the evolutionary phylogeny based on a recent large dataset, multispecies, and multigene-based alignment. In the strictest sense, the FFP-based trees are genome phylogenies, not species phylogenies. However, the species phylogeny is highly related to the whole-genome phylogeny. Furthermore, our results reveal that the footprints of evolutionary history are spread throughout the entire length of the whole genome of an organism and are not limited to genes, introns, or short, highly conserved, nongenic sequences that can be adversely affected by factors (such as a choice of sequences, homoplasy, and different mutation rates) resulting in inconsistent species phylogenies.


Subject(s)
Evolution, Molecular , Genome/genetics , Phylogeny , Animals , Computational Biology/methods , Exons , Genomics/methods , Humans , Introns , Mammals/classification , Mammals/genetics
6.
Proc Natl Acad Sci U S A ; 106(31): 12826-31, 2009 Aug 04.
Article in English | MEDLINE | ID: mdl-19553209

ABSTRACT

The vast sequence divergence among different virus groups has presented a great challenge to alignment-based sequence comparison among different virus families. Using an alignment-free comparison method, we construct the whole-proteome phylogeny for a population of viruses from 11 viral families comprising 142 large dsDNA eukaryote viruses. The method is based on the feature frequency profiles (FFP), where the length of the feature (l-mer) is selected to be optimal for phylogenomic inference. We observe that (i) the FFP phylogeny segregates the population into clades, the membership of each has remarkable agreement with current classification by the International Committee on the Taxonomy of Viruses, with one exception that the mimivirus joins the phycodnavirus family; (ii) the FFP tree detects potential evolutionary relationships among some viral families; (iii) the relative position of the 3 herpesvirus subfamilies in the FFP tree differs from gene alignment-based analysis; (iv) the FFP tree suggests the taxonomic positions of certain "unclassified" viruses; and (v) the FFP method identifies candidates for horizontal gene transfer between virus families.


Subject(s)
DNA Viruses/classification , Phylogeny , Proteome , Baculoviridae/classification , DNA Viruses/genetics , Gene Transfer, Horizontal , Herpesviridae/classification , Phycodnaviridae/classification , Poxviridae/classification , Sequence Alignment
7.
Proc Natl Acad Sci U S A ; 106(8): 2677-82, 2009 Feb 24.
Article in English | MEDLINE | ID: mdl-19188606

ABSTRACT

For comparison of whole-genome (genic + nongenic) sequences, multiple sequence alignment of a few selected genes is not appropriate. One approach is to use an alignment-free method in which feature (or l-mer) frequency profiles (FFP) of whole genomes are used for comparison-a variation of a text or book comparison method, using word frequency profiles. In this approach it is critical to identify the optimal resolution range of l-mers for the given set of genomes compared. The optimum FFP method is applicable for comparing whole genomes or large genomic regions even when there are no common genes with high homology. We outline the method in 3 stages: (i) We first show how the optimal resolution range can be determined with English books which have been transformed into long character strings by removing all punctuation and spaces. (ii) Next, we test the robustness of the optimized FFP method at the nucleotide level, using a mutation model with a wide range of base substitutions and rearrangements. (iii) Finally, to illustrate the utility of the method, phylogenies are reconstructed from concatenated mammalian intronic genomes; the FFP derived intronic genome topologies for each l within the optimal range are all very similar. The topology agrees with the established mammalian phylogeny revealing that intron regions contain a similar level of phylogenic signal as do coding regions.


Subject(s)
Genome , Introns , Phylogeny
8.
Proc Natl Acad Sci U S A ; 103(12): 4428-32, 2006 Mar 21.
Article in English | MEDLINE | ID: mdl-16537409

ABSTRACT

A method is presented for scoring the model quality of experimental and theoretical protein structures. The structural model to be evaluated is dissected into small fragments via a sliding window, where each fragment is represented by a vector of multiple phi-psi angles. The sliding window ranges in size from a length of 1-10 phi-psi pairs (3-12 residues). In this method, the conformation of each fragment is scored based on the fit of multiple phi-psi angles of the fragment to a database of multiple phi-psi angles from high-resolution x-ray crystal structures. We show that measuring the fit of predicted structural models to the allowed conformational space of longer fragments is a significant discriminator for model quality. Reasonable models have higher-order phi-psi score fit values (m) > -1.00.


Subject(s)
Models, Molecular , Protein Conformation , Proteins/chemistry , Research Design
9.
Proc Natl Acad Sci U S A ; 102(3): 618-21, 2005 Jan 18.
Article in English | MEDLINE | ID: mdl-15640351

ABSTRACT

We have mapped protein conformational space from two to seven residue lengths by employing multidimensional scaling on a data matrix composed of pair-wise angular distances for multiple phi-Psi values collected from high-resolution protein structures. The resulting global maps show clustering of peptide conformations that reveals a dramatic reduction of conformational space as sampled by experimentally observed peptides. Each map can be viewed as a higher order phi-Psi plot defining regions of space that are conformationally allowed.


Subject(s)
Models, Molecular , Proteins/chemistry , Cluster Analysis , Peptides/chemistry , Protein Conformation
10.
Nucleic Acids Res ; 31(19): 5607-16, 2003 Oct 01.
Article in English | MEDLINE | ID: mdl-14500824

ABSTRACT

A global conformational space of 6253 dinucleoside monophosphate (DMP) units consisting of RNA and DNA (free and protein/drug-bound) was 'mapped' using high resolution crystal structures cataloged in the Nucleic Acid Database (NDB). The torsion angles of each DMP were clustered in a reduced three-dimensional space using a classical multi-dimensional scaling method. The mapping of the conformational space reveals nine primary clusters which distinguish among the common A-, B- and Z-forms and their various substates, plus five secondary clusters for kinked or bent structures. Conformational relationships and possible transitional pathways among the substates are also examined using the conformational states of DNA and RNA bound with proteins or drugs as potential pathway intermediates.


Subject(s)
DNA/chemistry , Dinucleoside Phosphates/chemistry , RNA/chemistry , Algorithms , DNA/drug effects , DNA/metabolism , DNA-Binding Proteins/metabolism , Models, Molecular , Molecular Structure , Nucleic Acid Conformation , Principal Component Analysis , RNA/drug effects , RNA/metabolism , RNA-Binding Proteins/metabolism
11.
Proc Natl Acad Sci U S A ; 100(5): 2386-90, 2003 Mar 04.
Article in English | MEDLINE | ID: mdl-12606708

ABSTRACT

One of the principal goals of the structural genomics initiative is to identify the total repertoire of protein folds and obtain a global view of the "protein structure universe." Here, we present a 3D map of the protein fold space in which structurally related folds are represented by spatially adjacent points. Such a representation reveals a high-level organization of the fold space that is intuitively interpretable. The shape of the fold space and the overall distribution of the folds are defined by three dominant trends: secondary structure class, chain topology, and protein domain size. Random coil-like structures of small proteins and peptides are mapped to a region where the three trends converge, offering an interesting perspective on both the demography of fold space and the evolution of protein structures.


Subject(s)
Bacterial Proteins/chemistry , Biophysics , Proteins/chemistry , Biophysical Phenomena , Models, Molecular , Phylogeny , Protein Folding , Protein Structure, Tertiary
SELECTION OF CITATIONS
SEARCH DETAIL
...