Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 129
Filter
1.
Nature ; 420(6915): 563-73, 2002 Dec 05.
Article in English | MEDLINE | ID: mdl-12466851

ABSTRACT

Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure that each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into 33,409 'transcriptional units', contributing 90.1% of a newly established mouse transcriptome database. Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome. 41% of all transcriptional units showed evidence of alternative splicing. In protein-coding transcripts, 79% of splice variations altered the protein product. Whole-transcriptome analyses resulted in the identification of 2,431 sense-antisense pairs. The present work, completely supported by physical clones, provides the most comprehensive survey of a mammalian transcriptome so far, and is a valuable resource for functional genomics.


Subject(s)
DNA, Complementary/genetics , Genomics , Mice/genetics , Transcription, Genetic/genetics , Alternative Splicing/genetics , Amino Acid Motifs , Animals , Chromosomes, Mammalian/genetics , Cloning, Molecular , Databases, Genetic , Expressed Sequence Tags , Genes/genetics , Genomics/methods , Humans , Membrane Proteins/genetics , Physical Chromosome Mapping , Protein Structure, Tertiary , Proteome/chemistry , Proteome/genetics , RNA, Antisense/genetics , RNA, Messenger/analysis , RNA, Messenger/genetics , RNA, Untranslated/analysis , RNA, Untranslated/genetics , Transcription Initiation Site
2.
Trends Biotechnol ; 19(12): 482-6, 2001 Dec.
Article in English | MEDLINE | ID: mdl-11711174

ABSTRACT

Escherichia coli has been a popular organism for studying metabolic pathways. In an attempt to find out more about how these pathways are constructed, the enzymes were analysed by defining their protein domains. Structural assignments and sequence comparisons were used to show that 213 domain families constitute approximately 90% of the enzymes in the small-molecule metabolic pathways. Catalytic or cofactor-binding properties between family members are often conserved, while recognition of the main substrate with change in catalytic mechanism is only observed in a few cases of consecutive enzymes in a pathway. Recruitment of domains across pathways is very common, but there is little regularity in the pattern of domains in metabolic pathways. This is analogous to a mosaic in which a stone of a certain colour is selected to fill a position in the picture.


Subject(s)
Enzymes/chemistry , Enzymes/metabolism , Escherichia coli/enzymology , Binding Sites/physiology , Coenzymes/metabolism , Escherichia coli/metabolism , Evolution, Molecular , Fucose/metabolism , Nucleosides/metabolism , Nucleotides/metabolism , Protein Structure, Tertiary/physiology , Purines/biosynthesis , Pyrimidines/biosynthesis , Pyruvic Acid/metabolism , Sequence Homology , Substrate Specificity/physiology , Tryptophan/biosynthesis
3.
J Mol Biol ; 313(4): 903-19, 2001 Nov 02.
Article in English | MEDLINE | ID: mdl-11697912

ABSTRACT

Of the sequence comparison methods, profile-based methods perform with greater selectively than those that use pairwise comparisons. Of the profile methods, hidden Markov models (HMMs) are apparently the best. The first part of this paper describes calculations that (i) improve the performance of HMMs and (ii) determine a good procedure for creating HMMs for sequences of proteins of known structure. For a family of related proteins, more homologues are detected using multiple models built from diverse single seed sequences than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that arise at the model-building stage of the procedure. These two improvements greatly increase selectivity and coverage. The second part of the paper describes the construction of a library of HMMs, called SUPERFAMILY, that represent essentially all proteins of known structure. The sequences of the domains in proteins of known structure, that have identities less than 95 %, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the paper describes the use of the SUPERFAMILY model library to annotate the sequences of over 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison methods. For each genome, close to half of the sequences are matched in all or in part and, overall, the matches cover 35 % of eukaryotic genomes and 45 % of bacterial genomes. On average roughly 15% of genome sequences are labelled as being hypothetical yet homologous to proteins of known structure. The annotations derived from these matches are available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library.


Subject(s)
Computational Biology/methods , Genome , Markov Chains , Peptide Library , Proteins/chemistry , Proteins/genetics , Sequence Homology , Amino Acid Sequence , Animals , Computers , Databases, Protein , Genomics/methods , Humans , Internet , Models, Molecular , Molecular Sequence Data , Protein Structure, Tertiary , Reproducibility of Results , Sequence Alignment/methods
4.
Protein Sci ; 10(9): 1801-10, 2001 Sep.
Article in English | MEDLINE | ID: mdl-11514671

ABSTRACT

The sequence and structural analysis of cadherins allow us to find sequence determinants-a few positions in sequences whose residues are characteristic and specific for the structures of a given family. Comparison of the five extracellular domains of classic cadherins showed that they share the same sequence determinants despite only a nonsignificant sequence similarity between the N-terminal domain and other extracellular domains. This allowed us to predict secondary structures and propose three-dimensional structures for these domains that have not been structurally analyzed previously. A new method of assigning a sequence to its proper protein family is suggested: analysis of sequence determinants. The main advantage of this method is that it is not necessary to know all or almost all residues in a sequence as required for other traditional classification tools such as BLAST, FASTA, and HMM. Using the key positions only, that is, residues that serve as the sequence determinants, we found that all members of the classic cadherin family were unequivocally selected from among 80,000 examined proteins. In addition, we proposed a model for the secondary structure of the cytoplasmic domain of cadherins based on the principal relations between sequences and secondary structure multialignments. The patterns of the secondary structure of this domain can serve as the distinguishing characteristics of cadherins.


Subject(s)
Cadherins/chemistry , Computational Biology/methods , Algorithms , Amino Acid Sequence , Classification/methods , Databases as Topic , Molecular Sequence Data , Protein Structure, Secondary , Protein Structure, Tertiary , Sequence Alignment , Sequence Homology, Amino Acid , Structure-Activity Relationship
5.
J Mol Biol ; 311(4): 693-708, 2001 Aug 24.
Article in English | MEDLINE | ID: mdl-11518524

ABSTRACT

The 106 small molecule metabolic (SMM) pathways in Escherichia coli are formed by the protein products of 581 genes. We can define 722 domains, nearly all of which are homologous to proteins of known structure, that form all or part of 510 of these proteins. This information allows us to answer general questions on the structural anatomy of the SMM pathway proteins and to trace family relationships and recruitment events within and across pathways. Half the gene products contain a single domain and half are formed by combinations of between two and six domains. The 722 domains belong to one of 213 families that have between one and 51 members. Family members usually conserve their catalytic or cofactor binding properties; substrate recognition is rarely conserved. Of the 213 families, members of only a quarter occur in isolation, i.e. they form single-domain proteins. Most members of the other families combine with domains from just one or two other families and a few more versatile families can combine with several different partners. Excluding isoenzymes, more than twice as many homologues are distributed across pathways as within pathways. However, serial recruitment, with two consecutive enzymes both being recruited to another pathway, is rare and recruitment of three consecutive enzymes is not observed. Only eight of the 106 pathways have a high number of homologues. Homology between consecutive pairs of enzymes with conservation of the main substrate-binding site but change in catalytic mechanism (which would support a simple model of retrograde pathway evolution) occurs only six times in the whole set of enzymes. Most of the domains that form SMM pathways have homologues in non-SMM pathways. Taken together, these results imply a pervasive "mosaic" model for the formation of protein repertoires and pathways.


Subject(s)
Bacterial Proteins/chemistry , Bacterial Proteins/metabolism , Escherichia coli/chemistry , Escherichia coli/metabolism , Evolution, Molecular , Binding Sites , Conserved Sequence , Genes, Duplicate , Gluconeogenesis , Glycogen/metabolism , Histidine/biosynthesis , Markov Chains , Multigene Family , Nucleotides/metabolism , Phosphatidic Acids/biosynthesis , Polysaccharides/biosynthesis , Protein Structure, Tertiary , Proteome , Purines/biosynthesis , Pyrimidines/biosynthesis , Sequence Homology, Amino Acid
6.
Curr Opin Struct Biol ; 11(3): 354-63, 2001 Jun.
Article in English | MEDLINE | ID: mdl-11406387

ABSTRACT

The genome sequencing projects and knowledge of the entire protein repertoires of many organisms have prompted new procedures and techniques for the large-scale determination of protein structure, function and interactions. Recently, new work has been carried out on the determination of the function and evolutionary relationships of proteins by experimental structural genomics, and the discovery of protein-protein interactions by computational structural genomics.


Subject(s)
Evolution, Molecular , Genomics/methods , Proteins/physiology , Gene Order , Phylogeny , Protein Structure, Tertiary , Proteins/chemistry
7.
J Mol Biol ; 305(5): 1011-24, 2001 Feb 02.
Article in English | MEDLINE | ID: mdl-11162110

ABSTRACT

The ability to form selective cell-cell adhesions is an essential property of metazoan cells. Members of the cadherin superfamily are important regulators of this process in both vertebrates and invertebrates. With the advent of genome sequencing projects, determination of the full repertoire of cadherins available to an organism is possible and here we present the identification and analysis of the cadherin repertoires in the genomes of Caenorhabditis elegans and Drosophila melanogaster. Hidden Markov models of cadherin domains were matched to the protein sequences obtained from the translation of the predicted gene sequences. Matches were made to 21 C. elegans and 18 D. melanogaster sequences. Experimental and theoretical work on C. elegans sequences, and data from ESTs, show that three pairs of genes, and two triplets, should be merged to form five single genes. It also produced sequence changes at one or both of the 5' and 3' termini of half the sequences. In D. melanogaster it is probable that two of the cadherin genes should also be merged together and that three cadherin genes should be merged with other neighbouring genes. Of the 15 cadherin proteins found in C. elegans, 13 have the features of cell surface proteins, signal sequences and transmembrane helices; the other two have only signal sequences. Of the 17 in D. melanogaster, 11 at present have both features and another five have transmembrane helices. The evidence currently available suggests about one-third of the cadherins in the two organisms can be grouped into subfamilies in which all, or parts of, the molecules are conserved. Each organism also has a approximately 980 residue protein (CDH-11 and CG11059) with two cadherin domains and whose sequences match well over their entire length two proteins from human brain. Two proteins in C. elegans, HMR-1A and HMR-1B, and three in D. melanogaster, CadN, Shg and CG7527, have cytoplasmic domains homologous to those of the classical cadherin genes of chordates but their extracellular regions have different domain structures. Other common subclasses include the seven-helix membrane cadherins, Fat-like protocadherins and the Ret-like cadherins. At present, the remaining cadherins have no obvious similarities in their extracellular domain architecture or homologies to their cytoplasmic domains and may, therefore, represent species-specific or phylum-specific molecules.


Subject(s)
Cadherins/chemistry , Caenorhabditis elegans/chemistry , Drosophila melanogaster/chemistry , Amino Acid Sequence , Animals , Binding Sites , Cadherins/genetics , Cadherins/metabolism , Calcium/metabolism , Computational Biology/methods , Conserved Sequence , Epidermal Growth Factor/chemistry , Genomics , Laminin/chemistry , Models, Molecular , Molecular Sequence Data , Multigene Family , Protein Structure, Tertiary , Reverse Transcriptase Polymerase Chain Reaction , Sequence Alignment , Vertebrates
8.
Bioinformatics ; 16(2): 104-10, 2000 Feb.
Article in English | MEDLINE | ID: mdl-10842730

ABSTRACT

MOTIVATION: The Sequence Search Algorithm Assessment and Testing Toolkit (SAT) aims to be a complete package for the comparison of different protein homology search algorithms. The structural classification of proteins can provide us with a clear criterion for judgment in homology detection. There have been several assessments based on structural sequences with classifications but a good deal of similar work is now being repeated with locally developed procedures and programs. The SAT will provide developers with a complete package which will save time and produce more comparable performance assessments for search algorithms. The package is complete in the sense that it provides a non-redundant large sequence resource database, a well-characterized query database of proteins domains, all the parsers and some previous results from PSI-BLAST and a hidden markov model algorithm. RESULTS: An analysis on two different data sets was carried out using the SAT package. It compared the performance of a full protein sequence database (RSDB100) with a non-redundant representative sequence database derived from it (RSDB50). The performance measurement indicated that the full database is sub-optimal for a homology search. This result justifies the use of much smaller and faster RSDB50 than RSDB100 for the SAT. AVAILABILITY: A web site is up. The whole packa ge is accessible via www and ftp. ftp://ftp.ebi.ac.uk/pub/contrib/jong/SAT http://cyrah.ebi.ac.uk:1111/Proj/Bio/SAT http://www.mrc-lmb.cam.ac.uk/genomes/SAT In the package, some previous assessment results produced by the package can also be found for reference. CONTACT: jong@ebi.ac.uk


Subject(s)
Algorithms , Proteins/analysis , Databases, Factual
9.
Bioinformatics ; 16(2): 117-24, 2000 Feb.
Article in English | MEDLINE | ID: mdl-10842732

ABSTRACT

MOTIVATION: For large-scale structural assignment to sequences, as in computational structural genomics, a fast yet sensitive sequence search procedure is essential. A new approach using intermediate sequences was tested as a shortcut to iterative multiple sequence search methods such as PSI-BLAST. RESULTS: A library containing potential intermediate sequences for proteins of known structure (PDB-ISL) was constructed. The sequences in the library were collected from a large sequence database using the sequences of the domains of proteins of known structure as the query sequences and the program PSI-BLAST. Sequences of proteins of unknown structure can be matched to distantly related proteins of known structure by using pairwise sequence comparison methods to find homologues in PDB-ISL. Searches of PDB-ISL were calibrated, and the number of correct matches found at a given error rate was the same as that found by PSI-BLAST. The advantage of this library is that it uses pairwise sequence comparison methods, such as FASTA or BLAST2, and can, therefore, be searched easily and, in many cases, much more quickly than an iterative multiple sequence comparison method. The procedure is roughly 20 times faster than PSI-BLAST for small genomes and several hundred times for large genomes. AVAILABILITY: Sequences can be submitted to the PDB-ISL servers at http://stash.mrc-lmb.cam.ac.uk/PDB_ISL/ or http://cyrah.ebi.ac.uk:1111/Serv/PDB_ISL/ and can be downloaded from ftp://ftp.ebi.ac.uk/pub/contrib/jong/PDB_+ ++ISL/ CONTACT: sat@mrc-lmb.cam.ac.uk and jong@ebi.ac.uk


Subject(s)
Proteins/analysis , Sequence Analysis/methods , Peptide Library , Time Factors
10.
Bioinformatics ; 16(5): 458-64, 2000 May.
Article in English | MEDLINE | ID: mdl-10871268

ABSTRACT

MOTIVATION: Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database? RESULTS: Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database in terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. AVAILABILITY: All the RSDB files generated and the full analysis results are available through internet: ftp://ftp.ebi.ac. uk/pub/contrib/jong/RSDB/http://cyrah.e bi.ac.uk:1111/Proj/Bio/RSDB


Subject(s)
Databases, Factual , Proteins/genetics , Algorithms , Amino Acid Sequence , Gene Duplication , Sequence Alignment , Sequence Homology, Amino Acid
11.
J Mol Biol ; 296(5): 1367-83, 2000 Mar 10.
Article in English | MEDLINE | ID: mdl-10698639

ABSTRACT

The predicted proteins of the genome of Caenorhabditis elegans were analysed by various sequence comparison methods to identify the repertoire of proteins that are members of the immunoglobulin superfamily (IgSF). The IgSF is one of the largest families of protein domain in this genome and likely to be one of the major families in other multicellular eukaryotes too. This is because members of the superfamily are involved in a variety of functions including cell-cell recognition, cell-surface receptors, muscle structure and, in higher organisms, the immune system. Sixty-four proteins with 488 I set IgSF domains were identified largely by using Hidden Markov models. The domain architectures of the protein products of these 64 genes are described. Twenty-one of these had been characterised previously. We show that another 25 are related to proteins of known function. The C. elegans IgSF proteins can be classified into five broad categories: muscle proteins, protein kinases and phosphatases, three categories of proteins involved in the development of the nervous system, leucine-rich repeat containing proteins and proteins without homologues of known function, of which there are 18. The 19 proteins involved in nervous system development that are not kinases or phosphatases are homologues of neuroglian, axonin, NCAM, wrapper, klingon, ICCR and nephrin or belong to the recently identified zig gene family. Out of the set of 64 genes, 22 are on the X chromosome. This study should be seen as an initial description of the IgSF repertoire in C. elegans, because the current gene definitions may contain a number of errors, especially in the case of long sequences, and there may be IgSF genes that have not yet been detected. However, the proteins described here do provide an overview of the bulk of the repertoire of immunoglobulin superfamily members in C. elegans, a framework for refinement and extension of the repertoire as gene and protein definitions improve, and the basis for investigations of their function and for comparisons with the repertoires of other organisms.


Subject(s)
Caenorhabditis elegans/chemistry , Computational Biology , Helminth Proteins/chemistry , Immunoglobulins/chemistry , Multigene Family , Sequence Homology , Animals , Caenorhabditis elegans/enzymology , Caenorhabditis elegans/genetics , Cell Adhesion Molecules, Neuronal/chemistry , Cell Adhesion Molecules, Neuronal/genetics , Genes, Helminth/genetics , Helminth Proteins/genetics , Humans , Immunoglobulins/genetics , Leucine/genetics , Leucine/metabolism , Markov Chains , Multigene Family/genetics , Muscle Proteins/chemistry , Muscle Proteins/genetics , Nerve Tissue Proteins/chemistry , Nerve Tissue Proteins/genetics , Physical Chromosome Mapping , Protein Structure, Tertiary , Protein Tyrosine Phosphatases/chemistry , Protein Tyrosine Phosphatases/genetics , Protein-Tyrosine Kinases/chemistry , Protein-Tyrosine Kinases/genetics , Sequence Alignment , X Chromosome/genetics
12.
J Mol Biol ; 295(4): 979-95, 2000 Jan 28.
Article in English | MEDLINE | ID: mdl-10656805

ABSTRACT

T cell alphabeta receptors have binding sites for peptide-MHC complexes formed by six hypervariable regions. Analysis of the six atomic structures known for Valpha and for Vbeta domains shows that their first and second hypervariable regions have one of three or four different main-chain conformations (canonical structures). Six of these canonical structures have the same conformation in complexes with peptide-MHC complexes, the free receptor and/or in an isolated V domain. Thus, for at least the first and second hypervariable regions in the currently known structures, the conformation of the canonical structures is well defined in the free state and is conserved on formation of complexes with peptide-MHC. We identified the key residues that are mainly responsible for the conformation of each canonical structure. The first and second hypervariable regions of Valpha and Vbeta domains are encoded by the germline V segments. Humans have 37 functional Valpha segments and 47 Vbeta segments, and mice have 20 Vbeta segments. Inspection of the size of their hypervariable regions, and of sites that contain key residues, indicates that close to 70 % of Valpha segments and 90 % of Vbeta segments have hypervariable regions with a conformation of one of the known canonical structures. The alpha and beta V gene segments in both humans and mice have only a few combinations of different canonical structure in their first and second hypervariable regions. In human Vbeta domains, the number of different sequences with these canonical structure combinations is larger than in mice, whilst for Valpha domains it is probably smaller.


Subject(s)
Receptors, Antigen, T-Cell, alpha-beta/chemistry , Receptors, Antigen, T-Cell, alpha-beta/genetics , Amino Acid Sequence , Animals , Genes, T-Cell Receptor alpha , Genes, T-Cell Receptor beta , Genetic Variation , Humans , Hydrogen Bonding , Mice , Models, Molecular , Molecular Sequence Data , Protein Conformation , Sequence Alignment , Sequence Homology, Amino Acid , T-Lymphocytes/immunology
13.
J Mol Biol ; 295(3): 641-9, 2000 Jan 21.
Article in English | MEDLINE | ID: mdl-10623553

ABSTRACT

What are the selective pressures on protein sequences during evolution? Amino acid residues may be highly conserved for functional or structural (stability) reasons. Theoretical studies have proposed that residues involved in the folding nucleus may also be highly conserved. To test this we are using an experimental "fold approach" to the study of protein folding. This compares the folding and stability of a number of proteins that share the same fold, but have no common amino acid sequence or biological activity. The fold selected for this study is the immunoglobulin-like beta-sandwich fold, which is a fold that has no specifically conserved function. Four model proteins are used from two distinct superfamilies that share the immunoglobulin-like fold, the fibronectin type III and immunoglobulin superfamilies. Here, the fold approach and protein engineering are used to question the role of a highly conserved tyrosine in the "tyrosine corner" motif that is found ubiquitously and exclusively in Greek key proteins. In the four model beta-sandwich proteins characterised here, the tyrosine is the only residue that is absolutely conserved at equivalent sites. By mutating this position to phenylalanine, we show that the tyrosine hydroxyl is not required to nucleate folding in the immunoglobulin superfamily, whereas it is involved to some extent in early structure formation in the fibronectin type III superfamily. The tyrosine corner is important for stability, mutation to phenylalanine costs between 1.5 and 3 kcal mol(-1). We propose that the high level of conservation of the tyrosine is related to the structural restraints of the loop connecting the beta-sheets, representing an evolutionary "cul-de-sac".


Subject(s)
Evolution, Molecular , Protein Folding , Proteins/chemistry , Tyrosine/chemistry , Amino Acid Sequence , Models, Molecular , Molecular Sequence Data , Mutation , Proteins/genetics , Sequence Homology, Amino Acid
14.
J Comput Biol ; 7(5): 673-84, 2000.
Article in English | MEDLINE | ID: mdl-11153093

ABSTRACT

A previously developed algorithmic method for identifying a geometric invariant of protein structures, termed geometrical core, is extended to the C(L) and C(H1) domains of immunoglobulin molecules. The method uses the matrix of C(alpha) - C(alpha) distances and does not require the usual superposition of structures. The result of applying the algorithm to 53 Immunoglobulin structures led to the identification of two geometrical core sets of C(alpha) atom positions for the C(L) and C(H1) domains.


Subject(s)
Algorithms , Immunoglobulin Constant Regions/chemistry , Amino Acid Sequence , Databases, Factual , Immunoglobulin Constant Regions/genetics , Immunoglobulin Heavy Chains/chemistry , Immunoglobulin Heavy Chains/genetics , Immunoglobulin Light Chains/chemistry , Immunoglobulin Light Chains/genetics , Molecular Sequence Data , Protein Structure, Secondary , Protein Structure, Tertiary , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Sequence Homology, Amino Acid
15.
Nucleic Acids Res ; 28(1): 257-9, 2000 Jan 01.
Article in English | MEDLINE | ID: mdl-10592240

ABSTRACT

The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of known protein structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and distant evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database so far. The sequences of proteins in SCOP provide the basis of the ASTRAL sequence libraries that can be used as a source of data to calibrate sequence search algorithms and for the generation of statistics on, or selections of, protein structures. Links can be made from SCOP to PDB-ISL: a library containing sequences homologous to proteins of known structure. Sequences of proteins of unknown structure can be matched to distantly related proteins of known structure by using pairwise sequence comparison methods to find homologues in PDB-ISL. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop.mrc-lmb.cam.ac.uk/scop/


Subject(s)
Databases, Factual , Protein Conformation , Evolution, Molecular , Information Storage and Retrieval , Internet , Proteins/chemistry , Proteins/genetics
17.
J Mol Biol ; 290(1): 253-66, 1999 Jul 02.
Article in English | MEDLINE | ID: mdl-10388571

ABSTRACT

The sizes of atomic groups are a fundamental aspect of protein structure. They are usually expressed in terms of standard sets of radii for atomic groups and of volumes for both these groups and whole residues. Atomic groups, which subsume a heavy-atom and its covalently attached hydrogen atoms into one moiety, are used because the positions of hydrogen atoms in protein structures are generally not known. We have calculated new values for the radii of atomic groups and for the volumes of atomic groups. These values should prove useful in the analysis of protein packing, protein recognition and ligand design. Our radii for atomic groups were derived from intermolecular distance calculations on a large number (approximately 30,000) of crystal structures of small organic compounds that contain the same atomic groups to those found in proteins. Our radii show significant differences to previously reported values. We also use this new radii set to determine the packing efficiency in different regions of the protein interior. This analysis shows that, if the surface water molecules are included in the calculations, the overall packing efficiency throughout the protein interior is high and fairly uniform. However, if the water structure is removed, the packing efficiency in peripheral regions of the protein interior is underestimated, by approximately 3.5 %.


Subject(s)
Proteins/chemistry , Animals , Humans , Hydrogen Bonding
18.
Curr Opin Struct Biol ; 9(3): 390-9, 1999 Jun.
Article in English | MEDLINE | ID: mdl-10361097

ABSTRACT

New computational techniques have allowed protein folds to be assigned to all or parts of between a quarter (Caenorhabditis elegans) and a half (Mycoplasma genitalium) of the individual protein sequences in different genomes. These assignments give a new perspective on domain structures, gene duplications, protein families and protein folds in genome sequences.


Subject(s)
Computational Biology/methods , Computational Biology/trends , Genome , Proteins/chemistry , Proteins/genetics , Animals , Protein Conformation
19.
J Mol Biol ; 285(5): 2177-98, 1999 Feb 05.
Article in English | MEDLINE | ID: mdl-9925793

ABSTRACT

The non-covalent assembly of proteins that fold separately is central to many biological processes, and differs from the permanent macromolecular assembly of protein subunits in oligomeric proteins. We performed an analysis of the atomic structure of the recognition sites seen in 75 protein-protein complexes of known three-dimensional structure: 24 protease-inhibitor, 19 antibody-antigen and 32 other complexes, including nine enzyme-inhibitor and 11 that are involved in signal transduction.The size of the recognition site is related to the conformational changes that occur upon association. Of the 75 complexes, 52 have "standard-size" interfaces in which the total area buried by the components in the recognition site is 1600 (+/-400) A2. In these complexes, association involves only small changes of conformation. Twenty complexes have "large" interfaces burying 2000 to 4660 A2, and large conformational changes are seen to occur in those cases where we can compare the structure of complexed and free components. The average interface has approximately the same non-polar character as the protein surface as a whole, and carries somewhat fewer charged groups. However, some interfaces are significantly more polar and others more non-polar than the average. Of the atoms that lose accessibility upon association, half make contacts across the interface and one-third become fully inaccessible to the solvent. In the latter case, the Voronoi volume was calculated and compared with that of atoms buried inside proteins. The ratio of the two volumes was 1.01 (+/-0.03) in all but 11 complexes, which shows that atoms buried at protein-protein interfaces are close-packed like the protein interior. This conclusion could be extended to the majority of interface atoms by including solvent positions determined in high-resolution X-ray structures in the calculation of Voronoi volumes. Thus, water molecules contribute to the close-packing of atoms that insure complementarity between the two protein surfaces, as well as providing polar interactions between the two proteins.


Subject(s)
Models, Molecular , Proteins/chemistry , Proteins/metabolism , Amino Acids/chemistry , Antibodies/chemistry , Antibodies/metabolism , Antigens/chemistry , Antigens/metabolism , Binding Sites , Endopeptidases/chemistry , Endopeptidases/metabolism , GTP-Binding Proteins/chemistry , GTP-Binding Proteins/metabolism , Hydrogen Bonding , Protease Inhibitors/chemistry , Protease Inhibitors/metabolism , Protein Conformation , Water
20.
EMBO J ; 18(2): 297-305, 1999 Jan 15.
Article in English | MEDLINE | ID: mdl-9889186

ABSTRACT

Most cases of autosomal dominant polycystic kidney disease (ADPKD) are the result of mutations in the PKD1 gene. The PKD1 gene codes for a large cell-surface glycoprotein, polycystin-1, of unknown function, which, based on its predicted domain structure, may be involved in protein-protein and protein-carbohydrate interactions. Approximately 30% of polycystin-1 consists of 16 copies of a novel protein module called the PKD domain. Here we show that this domain has a beta-sandwich fold. Although this fold is common to a number of cell-surface modules, the PKD domain represents a distinct protein family. The tenth PKD domain of human and Fugu polycystin-1 show extensive conservation of surface residues suggesting that this region could be a ligand-binding site. This structure will allow the likely effects of missense mutations in a large part of the PKD1 gene to be determined.


Subject(s)
Polycystic Kidney, Autosomal Dominant/genetics , Proteins/chemistry , Proteins/genetics , Amino Acid Sequence , Animals , Base Sequence , Conserved Sequence , DNA Primers/genetics , Escherichia coli/genetics , Fishes, Poisonous/genetics , Humans , Magnetic Resonance Spectroscopy , Models, Molecular , Molecular Sequence Data , Mutation , Protein Conformation , Protein Structure, Secondary , Recombinant Proteins/chemistry , Recombinant Proteins/genetics , TRPP Cation Channels
SELECTION OF CITATIONS
SEARCH DETAIL
...