Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
J Mol Biol ; 314(5): 1041-52, 2001 Dec 14.
Article in English | MEDLINE | ID: mdl-11743721

ABSTRACT

Orthologs are genes in different species that originate from a single gene in the last common ancestor of these species. Such genes have often retained identical biological roles in the present-day organisms. It is hence important to identify orthologs for transferring functional information between genes in different organisms with a high degree of reliability. For example, orthologs of human proteins are often functionally characterized in model organisms. Unfortunately, orthology analysis between human and e.g. invertebrates is often complex because of large numbers of paralogs within protein families. Paralogs that predate the species split, which we call out-paralogs, can easily be confused with true orthologs. Paralogs that arose after the species split, which we call in-paralogs, however, are bona fide orthologs by definition. Orthologs and in-paralogs are typically detected with phylogenetic methods, but these are slow and difficult to automate. Automatic clustering methods based on two-way best genome-wide matches on the other hand, have so far not separated in-paralogs from out-paralogs effectively. We present a fully automatic method for finding orthologs and in-paralogs from two species. Ortholog clusters are seeded with a two-way best pairwise match, after which an algorithm for adding in-paralogs is applied. The method bypasses multiple alignments and phylogenetic trees, which can be slow and error-prone steps in classical ortholog detection. Still, it robustly detects complex orthologous relationships and assigns confidence values for both orthologs and in-paralogs. The program, called INPARANOID, was tested on all completely sequenced eukaryotic genomes. To assess the quality of INPARANOID results, ortholog clusters were generated from a dataset of worm and mammalian transmembrane proteins, and were compared to clusters derived by manual tree-based ortholog detection methods. This study led to the identification with a high degree of confidence of over a dozen novel worm-mammalian ortholog assignments that were previously undetected because of shortcomings of phylogenetic methods.A WWW server that allows searching for orthologs between human and several fully sequenced genomes is installed at http://www.cgb.ki.se/inparanoid/. This is the first comprehensive resource with orthologs of all fully sequenced eukaryotic genomes. Programs and tables of orthology assignments are available from the same location.


Subject(s)
Caenorhabditis elegans/genetics , Computational Biology/methods , Drosophila melanogaster/genetics , Evolution, Molecular , Genome , Genomics/methods , Sequence Homology , Algorithms , Animals , Automation/methods , Caenorhabditis elegans Proteins/genetics , Cluster Analysis , Databases, Genetic , Drosophila Proteins/genetics , Eukaryotic Cells/metabolism , Humans , Phylogeny , Software , Species Specificity
2.
Proteins ; 45(3): 262-73, 2001 Nov 15.
Article in English | MEDLINE | ID: mdl-11599029

ABSTRACT

Several protein sequence analysis algorithms are based on properties of amino acid composition and repetitiveness. These include methods for prediction of secondary structure elements, coiled-coils, transmembrane segments or signal peptides, and for assignment of low-complexity, nonglobular, or intrinsically unstructured regions. The quality of such analyses can be greatly enhanced by graphical software tools that present predicted sequence features together in context and allow judgment to be focused simultaneously on several different types of supporting information. For these purposes, we describe the SFINX package, which allows many different sets of segmental or continuous-curve sequence feature data, generated by individual external programs, to be viewed in combination alongside a sequence dot-plot or a multiple alignment of database matches. The implementation is currently based on extensions to the graphical viewers Dotter and Blixem and scripts that convert data from external programs to a simple generic data definition format called SFS. We describe applications in which dot-plots and flanking database matches provide valuable contextual information for analyses based on compositional and repetitive sequence features. The system is also useful for comparing results from algorithms run with a range of parameters to determine appropriate values for defaults or cutoffs for large-scale genomic analyses.


Subject(s)
Amino Acids/chemistry , Proteins/chemistry , Sequence Analysis, Protein/methods , Amino Acid Motifs , Amino Acid Sequence , Data Display , Databases, Protein , Internet , Membrane Proteins/chemistry , Protein Structure, Tertiary , Repetitive Sequences, Amino Acid , Software
3.
Bioinformatics ; 17(7): 656-7, 2001 Jul.
Article in English | MEDLINE | ID: mdl-11448885

ABSTRACT

UNLABELLED: MEDUSA is a tool for automatic selection and visual assessment of PCR primer pairs, developed to assist large scale gene expression analysis projects. The system allows specification of constraints of the location and distances between the primers in a pair. For instance, primers in coding, non-coding, exon/intron-spanning regions might be selected. Medusa applies these constraints as a filter to primers predicted by three external programs, and displays the resulting primer pairs graphically in the Blixem (Sonnhammer and Durbin, COMPUT: Appl. Biosci. 10, 301-307, 1994; http://www.cgr.ki.se/cgr/groups/sonnhammer/Blixem.html) viewer. AVAILABILITY: The MEDUSA web server is available at http://www.cgr.ki.se/cgr/MEDUSA. The source code and user information are available at ftp://ftp.cgr.ki.se/pub/prog/medusa.


Subject(s)
DNA Primers , Polymerase Chain Reaction/statistics & numerical data , Software , Base Sequence , Computational Biology , DNA Primers/genetics , Genomics , Molecular Sequence Data
4.
Bioinformatics ; 17(4): 343-8, 2001 Apr.
Article in English | MEDLINE | ID: mdl-11301303

ABSTRACT

MOTIVATION: Multi-domain proteins have evolved by insertions or deletions of distinct protein domains. Tracing the history of a certain domain combination can be important for functional annotation of multi-domain proteins, and for understanding the function of individual domains. In order to analyze the evolutionary history of the domains in modular proteins it is desirable to inspect a phylogenetic tree based on sequence divergence with the modular architecture of the sequences superimposed on the tree. RESULT: A Java applet, NIFAS, that integrates graphical domain schematics for each sequence in an evolutionary tree was developed. NIFAS retrieves domain information from the Pfam database and uses CLUSTAL W to calculate a tree for a given Pfam domain. The tree can be displayed with symbolic bootstrap values, and to allow the user to focus on a part of the tree, the layout can be altered by swapping nodes, changing the outgroup, and showing/collapsing subtrees. NIFAS is integrated with the Pfam database and is accessible over the internet (http://www.cgr.ki.se/Pfam). As an example, we use NIFAS to analyze the evolution of domains in Protein Kinases C.


Subject(s)
Evolution, Molecular , Image Processing, Computer-Assisted , Protein Kinase C/chemistry , Protein Structure, Tertiary , Proteins/chemistry , Software , Computer Graphics , Humans , Protein Kinase C/classification , Proteins/classification
5.
J Mol Biol ; 305(3): 567-80, 2001 Jan 19.
Article in English | MEDLINE | ID: mdl-11152613

ABSTRACT

We describe and validate a new membrane protein topology prediction method, TMHMM, based on a hidden Markov model. We present a detailed analysis of TMHMM's performance, and show that it correctly predicts 97-98 % of the transmembrane helices. Additionally, TMHMM can discriminate between soluble and membrane proteins with both specificity and sensitivity better than 99 %, although the accuracy drops when signal peptides are present. This high degree of accuracy allowed us to predict reliably integral membrane proteins in a large collection of genomes. Based on these predictions, we estimate that 20-30 % of all genes in most genomes encode membrane proteins, which is in agreement with previous estimates. We further discovered that proteins with N(in)-C(in) topologies are strongly preferred in all examined organisms, except Caenorhabditis elegans, where the large number of 7TM receptors increases the counts for N(out)-C(in) topologies. We discuss the possible relevance of this finding for our understanding of membrane protein assembly mechanisms. A TMHMM prediction service is available at http://www.cbs.dtu.dk/services/TMHMM/.


Subject(s)
Computational Biology/methods , Genome , Markov Chains , Membrane Proteins/chemistry , Animals , Bacterial Proteins/chemistry , Databases as Topic , Fungal Proteins/chemistry , Internet , Plant Proteins/chemistry , Porins/chemistry , Protein Sorting Signals , Protein Structure, Secondary , Reproducibility of Results , Research Design , Sensitivity and Specificity , Software , Solubility
6.
Curr Protoc Cell Biol ; Appendix 1: Appendix 1C, 2001 May.
Article in English | MEDLINE | ID: mdl-18228275

ABSTRACT

This brief appendix serves as a guide for the analysis of functional motifs in proteins. Several database search engines that can be accessed via the World Wide Web are described. Such computerized searches have become the preferred method to scan large sequence and motif databases, as the searches are efficient and the databases are updated frequently. A short list of sorting signals is also included, since these motifs often cannot be predicted reliably by a computer search.


Subject(s)
Amino Acid Motifs/genetics , Databases, Protein , Molecular Biology/methods , Proteins/chemistry , Proteins/genetics , Proteomics/methods , Animals , Computational Biology/methods , Humans , Information Storage and Retrieval/methods , Internet/organization & administration , Sequence Analysis, Protein/methods
8.
Nucleic Acids Res ; 28(1): 263-6, 2000 Jan 01.
Article in English | MEDLINE | ID: mdl-10592242

ABSTRACT

Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the WWW in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgr.ki.se/Pfam/ and in the US at http://pfam.wustl.edu/. The latest version (4.3) of Pfam contains 1815 families. These Pfam families match 63% of proteins in SWISS-PROT 37 and TrEMBL 9. For complete genomes Pfam currently matches up to half of the proteins. Genomic DNA can be directly searched against the Pfam library using the Wise2 package.


Subject(s)
Databases, Factual , Proteins/chemistry , Genome , Information Storage and Retrieval , Internet , Quality Control
9.
Bioinformatics ; 15(6): 480-500, 1999 Jun.
Article in English | MEDLINE | ID: mdl-10383473

ABSTRACT

MOTIVATION: Protein families can be defined based on structure or sequence similarity. We wanted to compare two protein family databases, one based on structural and one on sequence similarity, to investigate to what extent they overlap, the similarity in definition of corresponding families, and to create a list of large protein families with unknown structure as a resource for structural genomics. We also wanted to increase the sensitivity of fold assignment by exploiting protein family HMMs. RESULTS: We compared Pfam, a protein family database based on sequence similarity, to Scop, which is based on structural similarity. We found that 70% of the Scop families exist in Pfam while 57% of the Pfam families exist in Scop. Most families that occur in both databases correspond well to each other, but in some cases they are different. Such cases highlight situations in which structure and sequence approaches differ significantly. The comparison enabled us to compile a list of the largest families that do not occur in Scop; these are suitable targets for structure prediction and determination, and may be useful to guide projects in structural genomics. It can be noted that 13 out of the 20 largest protein families without a known structure are likely transmembrane proteins. We also exploited Pfam to increase the sensitivity of detecting homologs of proteins with known structure, by comparing query sequences to Pfam HMMs that correspond to Scop families. For SWISSPROT+TREMBL, this yielded an increase in fold assignment from 31% to 42% compared to using FASTA only. This method assigned a structure to 22% of the proteins in Saccharomyces cerevisiae, 24% in Escherichia coli, and 16% in Methanococcus jannaschii.


Subject(s)
Databases, Factual , Proteins/chemistry , Proteins/genetics , Computational Biology , Genome , Protein Folding , Proteins/classification , Sequence Alignment , Sequence Homology, Amino Acid
10.
Nucleic Acids Res ; 27(1): 260-2, 1999 Jan 01.
Article in English | MEDLINE | ID: mdl-9847196

ABSTRACT

Pfam is a collection of multiple alignments and profile hidden Markov models of protein domain families. Release 3.1 is a major update of the Pfam database and contains 1313 families which are available on the World Wide Web in Europe at http://www.sanger.ac.uk/Software/Pfam/ and http://www.cgr.ki.se/Pfam/, and in the US at http://pfam.wustl.edu/. Over 54% of proteins in SWISS-PROT-35 and SP-TrEMBL-5 match a Pfam family. The primary changes of Pfam since release 2.1 are that we now use the more advanced version 2 of the HMMER software, which is more sensitive and provides expectation values for matches, and that it now includes proteins from both SP-TrEMBL and SWISS-PROT.


Subject(s)
Databases, Factual , Proteins/chemistry , Sequence Alignment , Software , Algorithms , Amino Acid Sequence , Databases, Factual/standards , Information Storage and Retrieval , Internet , Protein Conformation , Proteins/genetics , Quality Control , Sequence Homology, Amino Acid , Statistics as Topic
11.
Article in English | MEDLINE | ID: mdl-9783223

ABSTRACT

A novel method to model and predict the location and orientation of alpha helices in membrane-spanning proteins is presented. It is based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system. The model is cyclic with 7 types of states for helix core, helix caps on either side, loop on the cytoplasmic side, two loops for the non-cytoplasmic side, and a globular domain state in the middle of each loop. The two loop paths on the non-cytoplasmic side are used to model short and long loops separately, which corresponds biologically to the two known different membrane insertions mechanisms. The close mapping between the biological and computational states allows us to infer which parts of the model architecture are important to capture the information that encodes the membrane topology, and to gain a better understanding of the mechanisms and constraints involved. Models were estimated both by maximum likelihood and a discriminative method, and a method for reassignment of the membrane helix boundaries were developed. In a cross validated test on single sequences, our transmembrane HMM, TMHMM, correctly predicts the entire topology for 77% of the sequences in a standard dataset of 83 proteins with known topology. The same accuracy was achieved on a larger dataset of 160 proteins. These results compare favourably with existing methods.


Subject(s)
Markov Chains , Membrane Proteins/chemistry , Membrane Proteins/genetics , Models, Molecular , Protein Structure, Secondary , Amino Acid Sequence , Artificial Intelligence , Databases, Factual , Molecular Sequence Data
12.
J Mol Graph Model ; 16(1): 1-5, 33, 1998 Feb.
Article in English | MEDLINE | ID: mdl-9783253

ABSTRACT

The two-dimensional contact map of interresidue distances is a visual analysis technique for protein structures. We present two standalone software tools designed to be used in combination to increase the versatility of this simple yet powerful technique. First, the program Structer calculates contact maps from three-dimensional molecular structural data. The contact map matrix can then be viewed in the graphical matrix-visualization program Dotter. Instead of using a predefined distance cutoff, we exploit Dotter's dynamic rendering control, allowing interactive exploration at varying distance cutoffs after calculating the matrix once. Structer can use a number of distance measures, can incorporate multiple chains in one contact map, and allows masking of user-defined residue sets. It works either directly with PDB files, or can use the MMDB network API for reading structures.


Subject(s)
Computer Simulation , Models, Molecular , Protein Conformation , Software , Diphtheria Toxin/chemistry , Histocompatibility Antigens Class II/chemistry , Humans
13.
Nucleic Acids Res ; 26(1): 320-2, 1998 Jan 01.
Article in English | MEDLINE | ID: mdl-9399864

ABSTRACT

Pfam contains multiple alignments and hidden Markov model based profiles (HMM-profiles) of complete protein domains. The definition of domain boundaries, family members and alignment is done semi-automatically based on expert knowledge, sequence similarity, other protein family databases and the ability of HMM-profiles to correctly identify and align the members. Release 2.0 of Pfam contains 527 manually verified families which are available for browsing and on-line searching via the World Wide Web in the UK at http://www.sanger.ac.uk/Pfam/ and in the US at http://genome.wustl. edu/Pfam/ Pfam 2.0 matches one or more domains in 50% of Swissprot-34 sequences, and 25% of a large sample of predicted proteins from the Caenorhabditis elegans genome.


Subject(s)
Databases, Factual , Proteins/chemistry , Sequence Alignment , Amino Acid Sequence , Animals , Caenorhabditis elegans , Computer Communication Networks , Information Storage and Retrieval , Markov Chains , Models, Molecular , Molecular Sequence Data
15.
Proteins ; 28(3): 405-20, 1997 Jul.
Article in English | MEDLINE | ID: mdl-9223186

ABSTRACT

Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam-A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences.


Subject(s)
Amino Acid Sequence , Databases, Factual , Plant Proteins/chemistry , Protein Structure, Tertiary , Sequence Alignment , Models, Chemical , Molecular Sequence Data , Multigene Family , Seeds/chemistry , Sequence Homology, Amino Acid
16.
J Mol Biol ; 270(4): 587-97, 1997 Jul 25.
Article in English | MEDLINE | ID: mdl-9245589

ABSTRACT

We have determined the complete nucleotide sequence of the human immunoglobulin D segment locus on chromosome 14q32.3 and identified a total of 27 D segments, of which nine are new. Comparison with a database of rearranged heavy chain sequences indicates that the human antibody repertoire is created by VDJ recombination involving 25 of these 27 D segments, extensive processing at the V-D and D-J junctions and use of multiple reading frames. We could find no evidence for the proposed use of DIR segments, inverted D segments, "minor" D segments or D-D recombination. Conventional VDJ recombination, which obeys the 12/23 rule, is therefore sufficient to explain the wealth of lengths and sequences for the third hypervariable loop of human heavy chains.


Subject(s)
Chromosomes, Human, Pair 14 , Immunoglobulin D/genetics , Recombination, Genetic , Base Sequence , Chromosome Mapping , Evolution, Molecular , Germ Cells , Humans , Immunoglobulin Joining Region/genetics , Immunoglobulin Variable Region/genetics , Molecular Sequence Data , Open Reading Frames
17.
Genomics ; 46(2): 200-16, 1997 Dec 01.
Article in English | MEDLINE | ID: mdl-9417907

ABSTRACT

The Caenorhabditis elegans genome sequencing project has completed over half of this nematode's 100-Mb genome. Proteins predicted in the finished sequence have been compiled and released in the data-base Wormpep. Presented here is a comprehensive analysis of protein domain families in Wormpep 11, which comprises 7299 proteins. The relative abundance of common protein domain families was counted by comparing all Wormpep proteins to the Pfam collection of protein families, which is based on recognition by hidden Markov models. This analysis also identified a number of previously unannotated domains. To investigate new apparently nematode-specific protein families, Wormpep was clustered into domain families on the basis of sequence similarity using the Domainer program. The largest clusters that lacked clear homology to proteins outside Nematoda were analyzed in further detail, after which some could be assigned a putative function. We compared all proteins in Wormpep 11 to proteins in the human, Saccharomyces cerevisiae, and Haemophilus influenzae genomes. Among the results are the estimation that over two-thirds of the currently known human proteins are likely to have a homologue in the whole C. elegans genome and that a significant number of proteins are well conserved between C. elegans and H. influenzae, that are not found in S. cerevisiae.


Subject(s)
Caenorhabditis elegans/genetics , Helminth Proteins/genetics , Helminth Proteins/metabolism , Proteins/genetics , Sequence Homology, Amino Acid , Amino Acid Sequence , Animals , Bacterial Proteins/genetics , Databases, Factual , Haemophilus influenzae/genetics , Humans , Molecular Sequence Data , Saccharomyces cerevisiae/genetics
18.
J Mol Biol ; 256(5): 813-17, 1996 Mar 15.
Article in English | MEDLINE | ID: mdl-8601832

ABSTRACT

In the human immune system, antibodies with high affinities for antigen are created in two stages. A diverse primary repertoire of antibody structures is produced by the combinatorial rearrangement of germline V gene segments and antibodies are selected from this repertoire by binding to the antigen. Their affinities are then improved by somatic hypermutation and further rounds of selection. We have dissected the sequence diversity created at each stage in response to a wide range of antigens. In the primary repertoire, diversity is focused at the centre of the binding site. With somatic hypermutation, diversity spreads to regions at the periphery of the binding site that are highly conserved in the primary repertoire. We propose that evolution has favoured this complementarity as an efficient strategy for searching sequence space and that the germline V gene families evolved to exploit the diversity created by somatic hypermutation.


Subject(s)
Antibody Diversity , Genes, Immunoglobulin , Immunoglobulin Variable Region/genetics , Mutation , Binding Sites, Antibody/genetics , Biological Evolution , Humans , Immunoglobulin Variable Region/chemistry , Immunoglobulin Variable Region/ultrastructure , Models, Genetic , Models, Molecular
19.
Gene ; 167(1-2): GC1-10, 1995 Dec 29.
Article in English | MEDLINE | ID: mdl-8566757

ABSTRACT

Graphical dot-matrix plots can provide the most complete and detailed comparison of two sequences. Presented here is DOTTER2, a dot-plot program for X-windows which can compare DNA or protein sequences, and also DNA versus protein. The main novel feature of DOTTER is that the user can vary the stringency cutoffs interactively, so that the dot-matrix only needs to be calculated once. This is possible thanks to a 'Greyramp tool' that was developed to change the displayed stringency of the matrix by dynamically changing the greyscale rendering of the dots. The Greyramp tool allows the user to interactively change the lower and upper score limit for the greyscale rendering. This allows exploration of the separation between signal and noise, and fine-grained visualisation of different score levels in the dot-matrix. Other useful features are dot-matrix compression, mouse-controlled zooming, sequence alignment display and saving/loading of dot-matrices. Since the matrix only has to be calculated once and since the algorithm is fast and linear in space, DOTTER is practical to use even for sequences as long as cosmids. DOTTER was integrated in the gene-modelling module of the genomic database system ACEDB3. This was done via the homology viewer BLIXEM in a way that also allows segments from the BLAST suite of searching programs to be superimposed on top of the full dot-matrix. This feature can also be used for very quick finding of the strongest matches. As examples, we analyse a Caenorhabditis elegans cosmid with several tandem repeat families, and illustrate how DOTTER can improve gene modelling.


Subject(s)
Sequence Analysis/methods , Sequence Homology, Amino Acid , Sequence Homology, Nucleic Acid , Software , Amino Acid Sequence , Data Display , Molecular Sequence Data
SELECTION OF CITATIONS
SEARCH DETAIL
...