Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Interdiscip Sci ; 1(1): 40-5, 2009 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-20640817

RESUMO

Protein function prediction is an important issue in the post-genomic era. When protein function is deduced from protein interaction data, the traditional methods treat each interaction sample equally, where the qualities of the interaction samples are seldom taken into account. In this paper, we investigate the effect of the quality of protein-protein interaction data on predicting protein function. Moreover, two improved methods, weight neighbour counting method (WNC) and weight chi-square method (WCHI), are proposed by considering the quality of interaction samples with the neighbour counting method (NC) and chi-square method (CHI). Experimental results have shown that the qualities of interaction samples affect the performances of protein function prediction methods seriously. It is also demonstrated that WNC and WCHI methods outperform NC and CHI methods in protein function prediction when example weights are chosen properly.


Assuntos
Bases de Dados de Proteínas/normas , Mapeamento de Interação de Proteínas , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Distribuição de Qui-Quadrado , Curva ROC
2.
Interdiscip Sci ; 1(1): 72-80, 2009 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-20640821

RESUMO

It is commonly considered that genes with similar expression profiles are functional related. And there are many ways to measure the similarity of gene expression data. Especially with the advent of lots of biological information, new combined measures have been constructed by combining the biological information with different similarity measures. However, it is not clear that what is the most suitable and effective measure for gene expression data and what is the most suitable measure to construct the most effective combined measure. In this paper, several similarity measures are analyzed and two new similarity measures are proposed. Their correspondent combined measures are also constructed by incorporating Gene Ontology annotations. All these measures are evaluated by their effectiveness in detecting functionally links in several different gene expression data by comparing with the protein-protein interaction database. The results show that the newly proposed measures and their correspondent combined measures are very effective and suitable for different datasets. And our methodology is applicable to evaluate new similarity measures and detect the best measure for a certain dataset.


Assuntos
Bases de Dados Genéticas , Estudos de Avaliação como Assunto , Regulação Fúngica da Expressão Gênica , Saccharomyces cerevisiae/genética , Reprodutibilidade dos Testes , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo
3.
Protein Eng Des Sel ; 19(11): 511-6, 2006 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-17032692

RESUMO

G-protein coupled receptors (GPCRs) are transmembrane proteins which via G-proteins initiate some of the important signaling pathways in a cell and are involved in various physiological processes. Thus, computational prediction and classification of GPCRs can supply significant information for the development of novel drugs in pharmaceutical industry. In this paper, a nearest neighbor method has been introduced to discriminate GPCRs from non-GPCRs and subsequently classify GPCRs at four levels on the basis of amino acid composition and dipeptide composition of proteins. Its performance is evaluated on a non-redundant dataset consisted of 1406 GPCRs for six families and 1406 globular proteins using the jackknife test. The present method based on amino acid composition achieved an overall accuracy of 96.4% and Matthew's correlation coefficient (MCC) of 0.930 for correctly picking out the GPCRs from globular proteins. The overall accuracy and MCC were further enhanced to 99.8% and 0.996 by dipeptide composition-based method. On the other hand, the present method has successfully classified 1406 GPCRs into six families with an overall accuracy of 89.6 and 98.8% using amino acid composition and dipeptide composition, respectively. For the subfamily prediction of 1181 GPCRs of rhodopsin-like family, the present method achieved an overall accuracy of 76.7 and 94.5% based on the amino acid composition and dipeptide composition, respectively. Finally, GPCRs belonging to the amine subfamily and olfactory subfamily of rhodopsin-like family were further analyzed at the type level. The overall accuracy of dipeptide composition-based method for the classification of amine type and olfactory type of GPCRs reached 94.5 and 86.9%, respectively, while the overall accuracy of amino acid composition-based method was very low for both subfamilies. In comparison with existing methods in the literature, the present method also displayed great competitiveness. These results demonstrate the effectiveness of our method on identifying and classifying GPCRs correctly. GPCRsIdentifier, a corresponding stand-alone executable program for GPCR identification and classification was also developed, which can be acquired freely on request from the authors for academic purposes.


Assuntos
Receptores Acoplados a Proteínas G/classificação , Algoritmos , Aminoácidos/análise , Biometria , Bases de Dados de Proteínas , Dipeptídeos/química , Engenharia de Proteínas , Receptores Acoplados a Proteínas G/química , Receptores Acoplados a Proteínas G/genética
4.
Yi Chuan ; 28(3): 329-33, 2006 Mar.
Artigo em Chinês | MEDLINE | ID: mdl-16551601

RESUMO

The NCBI Reference Sequence (RefSeq) database aimed to provide a biologically non-redundant collection of DNA, RNA, and protein sequences and to promote the research on genes and proteins of human beings and other species. However, because of widely distributed polymorphisms and different quality control of experiments in individual laboratories, there are potential problems need to be identified in the RefSeq database. Regarding which, we herein define the concept, standard transcript, based on the Central Dogmas of Biology that each standard transcript should be perfectly mapped to the standard genomic DNA sequence at the exon level. A large scale analysis for mapping all of the RefSeq records of human being (2005-4-18) to the officially released human genome sequence database (2005-4-20) was further performed using BLAT, Sim4 and a homemade program, EIparser, which was especially designed for this purpose. The standard transcripts based on the RefSeq database were obtained according to the alignment with standard human genome database. There are 9,771 RefSeq records of human being labeled with "NM_" and "NR_" could be perfectly mapped to human genome sequences, while other 10,943 records could be considered as standard transcripts after reasonable revision by comparing with the genome sequences according to all of the three methods. Moreover, the left 203 unrevisable records and 2,676 inconsistent records reported by the above programs could not be considered as standard transcripts and should be checked critically before using because of potential errors in them. Our study has thus provided a reference standard dataset of human beings with high quality for further bioinformatic and experimental analysis such as polymorphism and mutation of human genes. The reference standard dataset based on above criteria could be retrieved from http://biocompute.bmi.ac.cn/transcriptome/index.htm.


Assuntos
Bases de Dados Genéticas , Bases de Dados de Ácidos Nucleicos , Genoma Humano/genética , Humanos
5.
Comput Biol Chem ; 29(5): 388-92, 2005 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-16213794

RESUMO

The subcellular location of a protein is closely correlated with it biological function. In this paper, two new pattern classification methods termed as Nearest Feature Line (NFL) and Tunable Nearest Neighbor (TNN) have been introduced to predict the subcellular location of proteins based on their amino acid composition alone. The simulation experiments were performed with the jackknife test on a previously constructed data set, which consists of 2,427 eukaryotic and 997 prokaryotic proteins. All protein sequences in the data set fall into four eukaryotic subcellular locations and three prokaryotic subcellular locations. The NFL classifier reached the total prediction accuracies of 82.5% for the eukaryotic proteins and 91.0% for the prokaryotic proteins. The TNN classifier reached the total prediction accuracies of 83.6 and 92.2%, respectively. It is clear that high prediction accuracies have been achieved. Compared with Support Vector Machine (SVM) and Nearest Neighbor methods, these two methods display similar or even higher prediction accuracies. Hence, we conclude that NFL and TNN can be used as complementary methods for prediction of protein subcellular locations.


Assuntos
Algoritmos , Biologia Computacional/métodos , Proteínas/química , Bases de Dados de Proteínas , Proteínas/análise , Software
6.
FEBS Lett ; 579(16): 3444-8, 2005 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-15949806

RESUMO

To understand the structure and function of a protein, an important task is to know where it occurs in the cell. Thus, a computational method for properly predicting the subcellular location of proteins would be significant in interpreting the original data produced by the large-scale genome sequencing projects. The present work tries to explore an effective method for extracting features from protein primary sequence and find a novel measurement of similarity among proteins for classifying a protein to its proper subcellular location. We considered four locations in eukaryotic cells and three locations in prokaryotic cells, which have been investigated by several groups in the past. A combined feature of primary sequence defined as a 430D (dimensional) vector was utilized to represent a protein, including 20 amino acid compositions, 400 dipeptide compositions and 10 physicochemical properties. To evaluate the prediction performance of this encoding scheme, a jackknife test based on nearest neighbor algorithm was employed. The prediction accuracies for cytoplasmic, extracellular, mitochondrial, and nuclear proteins in the former dataset were 86.3%, 89.2%, 73.5% and 89.4%, respectively, and the total prediction accuracy reached 86.3%. As for the prediction accuracies of cytoplasmic, extracellular, and periplasmic proteins in the latter dataset, the prediction accuracies were 97.4%, 86.0%, and 79.7, respectively, and the total prediction accuracy of 92.5% was achieved. The results indicate that this method outperforms some existing approaches based on amino acid composition or amino acid composition and dipeptide composition.


Assuntos
Biologia Computacional/métodos , Espaço Intracelular/química , Proteínas/análise , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Células Eucarióticas/metabolismo , Espaço Intracelular/metabolismo , Células Procarióticas/metabolismo , Proteínas/metabolismo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...