Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 17(1): 23-43, 2001 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-11222260

RESUMO

MOTIVATION: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance. RESULTS: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.


Assuntos
Modelos Estatísticos , Proteínas/química , Análise de Sequência de Proteína/métodos , Algoritmos , Biologia Computacional , Bases de Dados Factuais , Cadeias de Markov , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos , Software
2.
Prog Biophys Mol Biol ; 73(5): 297-320, 2000.
Artigo em Inglês | MEDLINE | ID: mdl-11063777

RESUMO

As the number of complete genomes that have been sequenced keeps growing, unknown areas of the protein space are revealed and new horizons open up. Most of this information will be fully appreciated only when the structural information about the encoded proteins becomes available. The goal of structural genomics is to direct large-scale efforts of protein structure determination, so as to increase the impact of these efforts. This review focuses on current approaches in structural genomics aimed at selecting representative proteins as targets for structure determination. We will discuss the concept of representative structures/folds, the current methodologies for identifying those proteins, and computational techniques for identifying proteins which are expected to adopt new structural folds.


Assuntos
Genômica/métodos , Conformação Proteica , Bases de Dados Factuais , Dobramento de Proteína , Proteínas/química , Proteínas/classificação , Análise de Sequência de Proteína
3.
Artigo em Inglês | MEDLINE | ID: mdl-10977100

RESUMO

In search for global principles that may explain the organization of the space of all possible proteins, we study all known protein sequences and structures. In this paper we present a global map of the protein space based on our analysis. Our protein space contains all protein sequences in a non-redundant (NR) database, which includes all major sequence databases. Using the PSI-BLAST procedure we defined 4,670 clusters of related sequences in this space. Of these clusters, 1,421 are centered on a sequence of known structure. All 4,670 clusters were then compared using either a structure metric (when 3D structures are known) or a novel sequence profile metric. These scores were used to define a unified and consistent metric between all clusters. Two schemes were employed to organize these clusters in a meta-organization. The first uses a graph theory method and cluster the clusters in an hierarchical organization. This organization extends our ability to predict the structure and function of many proteins beyond what is possible with existing tools for sequence analysis. The second uses a variation on a multidimensional scaling technique to embed the clusters in a low dimensional real space. This last approach resulted in a projection of the protein space onto a 2D plane that provides us with a bird's eye view of the protein space. Based on this map we suggest a list of possible target sequences with unknown structure that are likely to adopt new, unknown folds.


Assuntos
Algoritmos , Dobramento de Proteína , Proteínas/classificação , Análise de Sequência de Proteína/métodos , Animais , Bases de Dados Factuais , Humanos , Recém-Nascido , Proteínas/genética
4.
Nucleic Acids Res ; 28(1): 49-55, 2000 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-10592179

RESUMO

The ProtoMap site offers an exhaustive classification of all proteins in the SWISS-PROT database, into groups of related proteins. The classification is based on analysis of all pairwise similarities among protein sequences. The analysis makes essential use of transitivity to identify homologies among proteins. Within each group of the classification, every two members are either directly or transitively related. However, transitivity is applied restrictively in order to prevent unrelated proteins from clustering together. The classification is done at different levels of confidence, and yields a hierarchical organization of all proteins. The resulting classification splits the protein space into well-defined groups of proteins, which are closely correlated with natural biological families and superfamilies. Many clusters contain protein sequences that are not classified by other databases. The hierarchical organization suggested by our analysis may help in detecting finer subfamilies in families of known proteins. In addition it brings forth interesting relationships between protein families, upon which local maps for the neighborhood of protein families can be sketched. The ProtoMap web server can be accessed at http://www.protomap.cs.huji.ac.il


Assuntos
Bases de Dados Factuais , Proteínas/genética , Gráficos por Computador , Armazenamento e Recuperação da Informação , Internet , Proteínas/química , Interface Usuário-Computador
5.
Proteins ; 37(3): 360-78, 1999 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-10591097

RESUMO

We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an all-vs.-all comparison of SWISSPROT, gives a very conservative initial classification based on the highest scoring pairs. The many classes in this classification correspond to protein subfamilies. Subsequently we merge the subclasses using the weaker pairs in a two-phase clustering algorithm. The algorithm makes use of transitivity to identify homologous proteins; however, transitivity is applied restrictively in an attempt to prevent unrelated proteins from clustering together. This process is repeated at varying levels of statistical significance. Consequently, a hierarchical organization of all proteins is obtained. The resulting classification splits the protein space into well-defined groups of proteins, which are closely correlated with natural biological families and superfamilies. Different indices of validity were applied to assess the quality of our classification and compare it with the protein families in the PROSITE and Pfam databases. Our classification agrees with these domain-based classifications for between 64.8% and 88.5% of the proteins. It also finds many new clusters of protein sequences which were not classified by these databases. The hierarchical organization suggested by our analysis reveals finer subfamilies in families of known proteins as well as many novel relations between protein families.


Assuntos
Proteínas/classificação , Homologia de Sequência de Aminoácidos , Algoritmos , Bases de Dados Factuais , Evolução Molecular , Proteínas/química
6.
Artigo em Inglês | MEDLINE | ID: mdl-9783227

RESUMO

We investigate the space of all protein sequences. We combine the standard measures of similarity (SW, FASTA, BLAST), to associate with each sequence an exhaustive list of neighboring sequences. These lists induce a (weighted directed) graph whose vertices are the sequences. The weight of an edge connecting two sequences represents their degree of similarity. This graph encodes much of the fundamental properties of the sequence space. We look for clusters of related proteins in this graph. These clusters correspond to strongly connected sets of vertices. Two main ideas underlie our work: i) Interesting homologies among proteins can be deduced by transitivity. ii) Transitivity should be applied restrictively in order to prevent unrelated proteins from clustering together. Our analysis starts from a very conservative classification, based on very significant similarities, that has many classes. Subsequently, classes are merged to include less significant similarities. Merging is performed via a novel two phase algorithm. First, the algorithm identifies groups of possibly related clusters (based on transitivity and strong connectivity) using local considerations, and merges them. Then, a global test is applied to identify nuclei of strong relationships within these groups of clusters, and the classification is refined accordingly. This process takes place at varying thresholds of statistical significance, where at each step the algorithm is applied on the classes of the previous classification, to obtain the next one, at the more permissive threshold. Consequently, a hierarchical organization of all proteins is obtained. The resulting classification splits the space of all protein sequences into well defined groups of proteins. The results show that the automatically induced sets of proteins are closely correlated with natural biological families and super families. The hierarchical organization reveals finer sub-families that make up known families of proteins as well as many interesting relations between protein families. The hierarchical organization proposed may be considered as the first map of the space of all protein sequences. An interactive web site including the results of our analysis has been constructed, and is now accessible through http:/(/)www.protomap.cs.huji.ac.il


Assuntos
Proteínas/classificação , Proteínas/genética , Algoritmos , Inteligência Artificial , Análise por Conglomerados , Bases de Dados Factuais , Proteínas/química , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos
7.
J Mol Biol ; 268(2): 539-56, 1997 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-9159489

RESUMO

A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acid residues and a dynamic programming distance is calculated between each pair of segments. This space of segments is initially embedded into Euclidean space. The algorithm that we apply embeds every finite metric space into Euclidean space so that (1) the dimension of the host space is small, (2) the metric distortion is small. A novel self-organized, cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. We monitor the validity of our clustering by randomly splitting the data into two parts and performing an hierarchical clustering algorithm independently on each part. At every level of the hierarchy we cross-validate the clusters in one part with the clusters in the other. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural data about proteins. Some of the known families clustered into well distinct clusters. Motifs and domains such as the zinc finger, EF hand, homeobox, EGF-like and others are automatically correctly identified, and relations between protein families are revealed by examining the splits along the tree. This clustering leads to a novel representation of protein families, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporter family. Finally, we introduce a new concise representation for complete proteins that is very useful in presenting multiple alignments, and in searching for close relatives in the database. The self-organization method presented is very general and applies to any data with a consistent and computable measure of similarity between data items.


Assuntos
Análise por Conglomerados , Proteínas/classificação , Análise de Sequência/métodos , Sequência de Aminoácidos , Animais , Proteínas de Ligação a DNA/classificação , Hemeproteínas/classificação , Proteínas de Homeodomínio/classificação , Humanos , Metaloproteínas/classificação , Dados de Sequência Molecular , Dedos de Zinco
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...