ABSTRACT
This paper reports two studies to model the inter-relationships between protein sequence, structure and function. First, an automated pipeline to provide a structural annotation of proteomes in the major genomes is described. The results are stored in a database at Imperial College, London (3D-GENOMICS) that can be accessed at www.sbg.bio.ic.ac.uk. Analysis of the assignments to structural superfamilies provides evolutionary insights. 3D-GENOMICS is being integrated with related proteome annotation data at University College London and the European Bioinformatics Institute in a project known as e-protein (http://www.e-protein.org/). The second topic is motivated by the developments in structural genomics projects in which the structure of a protein is determined prior to knowledge of its function. We have developed a new approach PHUNCTIONER that uses the gene ontology (GO) classification to supervise the extraction of the sequence signal responsible for protein function from a structure-based sequence alignment. Using GO we can obtain profiles for a range of specificities described in the ontology. In the region of low sequence similarity (around 15%), our method is more accurate than assignment from the closest structural homologue. The method is also able to identify the specific residues associated with the function of the protein family.
Subject(s)
Evolution, Molecular , Proteome/chemistry , Proteome/metabolism , Computational Biology , Multigene Family , Protein Structure, Tertiary , Proteome/genetics , Sequence HomologyABSTRACT
We present a systematic study of the clustering of genes within the human genome based on homology inferred from both sequence and structural similarity. The 3D-Genomics automated proteome annotation pipeline () was utilised to infer homology for each protein domain in the genome, for the 26 superfamilies most highly represented in the Structural Classification Of Proteins (SCOP) database. This approach enabled us to identify homologues that could not be detected by sequence-based methods alone. For each superfamily, we investigated the distribution, both within and among chromosomes, of genes encoding at least one domain within the superfamily. The results indicate a diversity of clustering behaviours: some superfamilies showed no evidence of any clustering, and others displayed significant clustering either within or among chromosomes, or both. Removal of tandem repeats reduced the levels of clustering observed, but some superfamilies still displayed highly significant clustering. Thus, our study suggests that either the process of gene duplication, or the evolution of the resulting clusters, differs between structural superfamilies.
Subject(s)
Genome, Human , Multigene Family/genetics , Protein Structure, Tertiary/genetics , Cadherins/classification , Cadherins/genetics , Chromosomes, Human/genetics , Computational Biology , Fibronectins/classification , Fibronectins/genetics , Gene Duplication , Genomics , Homeodomain Proteins/classification , Homeodomain Proteins/genetics , Humans , Sequence Homology , Tandem Repeat Sequences/geneticsABSTRACT
The 3D-GENOMICS database (http://www.sbg.bio. ic.ac.uk/3dgenomics/) provides structural annotations for proteins from sequenced genomes. In August 2003 the database included data for 93 proteomes. The annotations stored in the database include homologous sequences from various sequence databases, domains from SCOP and Pfam, patterns from Prosite and other predicted sequence features such as transmembrane regions and coiled coils. In addition to annotations at the sequence level, several precomputed cross- proteome comparative analyses are available based on SCOP domain superfamily composition. Annotations are available to the user via a web interface to the database. Multiple points of entry are available so that a user is able to: (i) directly access annotations for a single protein sequence via keywords or accession codes, (ii) examine a sequence of interest chosen from a summary of annotations for a particular proteome, or (iii) access precomputed frequency-based cross-proteome comparative analyses.
Subject(s)
Databases, Protein , Genomics , Proteins/chemistry , Proteins/metabolism , Proteomics , Amino Acid Sequence , Animals , Computational Biology , Genome , Humans , Information Storage and Retrieval , Internet , Molecular Sequence Data , Protein Conformation , Proteins/genetics , Proteome , Sequence Alignment , Sequence Homology, Amino Acid , User-Computer InterfaceABSTRACT
Eukaryotic initiation factor 4B (eIF4B) is a multidomain protein with a range of activities that serves primarily to promote association of messenger RNA to the 40S ribosomal subunit during translation initiation. We report here the solution structure of the eIF4B RNA recognition motif (RRM) domain. It adopts a classical RRM fold, with a beta alpha beta beta alpha beta topology. The most striking difference with other RRM structures is in the disposition of loop 3, which connects the beta 2 and beta 3 strands and is implicated in RNA recognition. This loop folds down against the body of the RRM and exhibits restricted motion on a milli- to microsecond time scale. Although it contributes to a large basic patch on the RNA binding surface, it does not protrude out from the domain as observed in other RRM structures, possibly implying a different mode of RNA binding. On its own, the core RRM domain provides only a relative weak interaction with RNA targets and appears to require extensions at the N- and C-terminus for high-affinity binding.