Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 39
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Artigo em Inglês | MEDLINE | ID: mdl-32750873

RESUMO

The identification of essential proteins is an important problem in bioinformatics. During the past decades, many centrality measures and algorithms have been proposed to address this issue. However, existing methods still deserve the following drawbacks: (1) the lack of a context-free and readily interpretable quantification of their centrality values; (2) the difficulty of specifying a proper threshold for their centrality values; (3) the incapability of controlling the quality of reported essential proteins in a statistically sound manner. To overcome the limitations of existing solutions, we tackle the essential protein discovery problem from a significance testing perspective. More precisely, the essential protein discovery problem is formulated as a multiple hypothesis testing problem, where the null hypothesis is that each protein is not an essential protein. To quantify the statistical significance of each protein, we present a p-value calculation method in which both the degree and the local clustering coefficient are used as the test statistic and the Erdös-Rényi model is employed as the random graph model. After calculating the p-value for each protein, the false discovery rate is used as the error rate in the multiple testing correction procedure. Our significance-based essential protein discovery method is named as SigEP, which is tested on both simulated networks and real PPI networks. The experimental results show that our method is able to achieve better performance than those competing algorithms.


Assuntos
Mapas de Interação de Proteínas , Proteínas , Algoritmos , Biologia Computacional , Proteínas/genética
2.
Sci Rep ; 11(1): 20304, 2021 10 13.
Artigo em Inglês | MEDLINE | ID: mdl-34645850

RESUMO

Community detection is a fundamental procedure in the analysis of network data. Despite decades of research, there is still no consensus on the definition of a community. To analytically test the realness of a candidate community in weighted networks, we present a general formulation from a significance testing perspective. In this new formulation, the edge-weight is modeled as a censored observation due to the noisy characteristics of real networks. In particular, the edge-weights of missing links are incorporated as well, which are specified to be zeros based on the assumption that they are truncated or unobserved. Thereafter, the community significance assessment issue is formulated as a two-sample test problem on censored data. More precisely, the Logrank test is employed to conduct the significance testing on two sets of augmented edge-weights: internal weight set and external weight set. The presented approach is evaluated on both weighted networks and un-weighted networks. The experimental results show that our method can outperform prior widely used evaluation metrics on the task of individual community validation.

3.
Patterns (N Y) ; 2(9): 100321, 2021 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-34553168

RESUMO

Influential node identification plays a significant role in understanding network structure and functions. Here we propose a general method for detecting influential nodes in a graph-traversal framework. We evaluate the influence of each node by constructing a breadth-first search (BFS) tree in which the target node is the root node. From the BFS tree, we generate a curve in which the x axis is the level number and the y axis is the cumulative scores of all nodes visited so far. We use the area under the curve value as the final influence score of the target node. Experimental results on various networks across different domains demonstrate that our method can be significantly superior to widely used centrality measures on the task of influential node detection.

4.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2788-2794, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34347602

RESUMO

Essential protein plays a vital role in understanding the cellular life. With the advance in high-throughput technologies, a number of protein-protein interaction (PPI) networks have been constructed such that essential proteins can be identified from a system biology perspective. Although a series of network-based essential protein discovery methods have been proposed, these existing methods still have some drawbacks. Recently, it has been shown that the significance-based method SigEP is promising on overcoming the defects that are inherent in currently available essential protein identification methods. However, the SigEP method is developed under the unrealistic Erdös-Rényi (E-R) model and its time complexity is very high. Hence, we propose a new significance-based essential protein recognition method named EPCS in which the essential protein discovery problem is formulated as a community significance testing problem. Experimental results on four PPI networks show that EPCS performs better than nine state-of-the-art essential protein identification methods and the only significance-based essential protein identification method SigEP.


Assuntos
Mapeamento de Interação de Proteínas/métodos , Mapas de Interação de Proteínas/genética , Proteínas/genética , Genes Essenciais/genética
5.
IEEE/ACM Trans Comput Biol Bioinform ; 17(6): 2062-2073, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31027047

RESUMO

The detection of protein complexes from protein-protein interaction network is a fundamental issue in bioinformatics and systems biology. To solve this problem, numerous methods have been proposed from different angles in the past decades. However, the study on detecting statistically significant protein complexes still has not received much attention. Although there are a few methods available in the literature for identifying statistically significant protein complexes, none of these methods can provide a more strict control on the error rate of a protein complex in terms of family-wise error rate (FWER). In this paper, we propose a new detection method SSF that is capable of controlling the FWER of each reported protein complex. More precisely, we first present a p-value calculation method based on Fisher's exact test to quantify the association between each protein and a given candidate protein complex. Consequently, we describe the key modules of the SSF algorithm: a seed expansion procedure for significant protein complexes search and a set cover strategy for redundancy elimination. The experimental results on five benchmark data sets show that: (1) our method can achieve the highest precision; (2) it outperforms three competing methods in terms of normalized mutual information (NMI) and F1 score in most cases.


Assuntos
Algoritmos , Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Mapas de Interação de Proteínas/genética , Proteínas/análise , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Biologia de Sistemas
6.
Artigo em Inglês | MEDLINE | ID: mdl-28534782

RESUMO

Affinity Purification-Mass Spectrometry (AP-MS) is one of the most important technologies for constructing protein-protein interaction (PPI) networks. In this paper, we propose an ensemble method, Reinforce, for inferring PPI network from AP-MS data set. The new algorithm named Reinforce is based on rank aggregation and false discovery rate control. Under the null hypothesis that the interaction scores from different scoring methods are randomly generated, Reinforce follows three steps to integrate multiple ranking results from different algorithms or different data sets. The experimental results show that Reinforce can get more stable and accurate inference results than existing algorithms. The source codes of Reinforce and data sets used in the experiments are available at: https://sourceforge.net/projects/reinforce/.


Assuntos
Biologia Computacional/métodos , Espectrometria de Massas/métodos , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Simulação por Computador , Bases de Dados de Proteínas , Mapas de Interação de Proteínas
7.
BMC Bioinformatics ; 19(1): 535, 2018 Dec 20.
Artigo em Inglês | MEDLINE | ID: mdl-30572820

RESUMO

BACKGROUND: Identifying protein complexes from protein-protein interaction (PPI) network is one of the most important tasks in proteomics. Existing computational methods try to incorporate a variety of biological evidences to enhance the quality of predicted complexes. However, it is still a challenge to integrate different types of biological information into the complexes discovery process under a unified framework. Recently, attributed network embedding methods have be proved to be remarkably effective in generating vector representations for nodes in the network. In the transformed vector space, both the topological proximity and node attributed affinity between different nodes are preserved. Therefore, such attributed network embedding methods provide us a unified framework to integrate various biological evidences into the protein complexes identification process. RESULTS: In this article, we propose a new method called GANE to predict protein complexes based on Gene Ontology (GO) attributed network embedding. Firstly, it learns the vector representation for each protein from a GO attributed PPI network. Based on the pair-wise vector representation similarity, a weighted adjacency matrix is constructed. Secondly, it uses the clique mining method to generate candidate cores. Consequently, seed cores are obtained by ranking candidate cores based on their densities on the weighted adjacency matrix and removing redundant cores. For each seed core, its attachments are the proteins with correlation score that is larger than a given threshold. The combination of a seed core and its attachment proteins is reported as a predicted protein complex by the GANE algorithm. For performance evaluation, we compared GANE with six protein complex identification methods on five yeast PPI networks. Experimental results showes that GANE performs better than the competing algorithms in terms of different evaluation metrics. CONCLUSIONS: GANE provides a framework that integrate many valuable and different biological information into the task of protein complex identification. The protein vector representation learned from our attributed PPI network can also be used in other tasks, such as PPI prediction and disease gene prediction.


Assuntos
Mapeamento de Interação de Proteínas/métodos , Proteínas/metabolismo , Proteômica/métodos , Humanos
8.
Front Genet ; 9: 272, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30087694

RESUMO

Identifying protein complexes from protein-protein interaction networks (PPINs) is important to understand the science of cellular organization and function. However, PPINs produced by high-throughput studies have high false discovery rate and only represent snapshot interaction information. Reconstructing higher quality PPINs is essential for protein complex identification. Here we present a Multi-Level PPINs reconstruction (MLPR) method for protein complexes detection. From existing PPINs, we generated full combinations of every two proteins. These protein pairs are represented as a vector which includes six different sources. Then the protein pairs with same vector are mapped to the same fingerprint ID. A fingerprint similarity network is constructed next, in which a vertex represents a protein pair fingerprint ID and each vertex is connected to its top 10 similar fingerprints by edges. After random walking on the fingerprints similarity network, each vertex got a score at the steady state. According to the score of protein pairs, we considered the top ranked ones as reliable PPI and the score as the weight of edge between two distinct proteins. Finally, we expanded clusters starting from seeded vertexes based on the new weighted reliable PPINs. Applying our method on the yeast PPINs, our algorithm achieved higher F-value in protein complexes detection than the-state-of-the-art methods. The interactions in our reconstructed PPI network have more significant biological relevance than the exiting PPI datasets, assessed by gene ontology. In addition, the performance of existing popular protein complexes detection methods are significantly improved on our reconstructed network.

9.
BMC Syst Biol ; 11(Suppl 4): 82, 2017 Sep 21.
Artigo em Inglês | MEDLINE | ID: mdl-28950876

RESUMO

BACKGROUND: Affinity purification-mass spectrometry (AP-MS) has been widely used for generating bait-prey data sets so as to identify underlying protein-protein interactions and protein complexes. However, the AP-MS data sets in terms of bait-prey pairs are highly noisy, where candidate pairs contain many false positives. Recently, numerous computational methods have been developed to identify genuine interactions from AP-MS data sets. However, most of these methods aim at removing false positives that contain contaminants, ignoring the distinction between direct interactions and indirect interactions. RESULTS: In this paper, we present an initialization-and-refinement framework for inferring direct PPI networks from AP-MS data, in which an initial network is first generated with existing scoring methods and then a refined network is constructed by the application of indirect association removal methods. Experimental results on several real AP-MS data sets show that our method is capable of identifying more direct interactions than traditional scoring methods. CONCLUSIONS: The proposed framework is sufficiently general to incorporate any feasible methods in each step so as to have potential for handling different types of AP-MS data in the future applications.


Assuntos
Cromatografia de Afinidade , Biologia Computacional/métodos , Espectrometria de Massas , Mapeamento de Interação de Proteínas
10.
Artigo em Inglês | MEDLINE | ID: mdl-27076459

RESUMO

Since the discovery of the regulatory function of microRNA (miRNA), increased attention has focused on identifying the relationship between miRNA and disease. It has been suggested that computational method are an efficient way to identify potential disease-related miRNAs for further confirmation using biological experiments. In this paper, we first highlighted three limitations commonly associated with previous computational methods. To resolve these limitations, we established disease similarity subnetwork and miRNA similarity subnetwork by integrating multiple data sources, where the disease similarity is composed of disease semantic similarity and disease functional similarity, and the miRNA similarity is calculated using the miRNA-target gene and miRNA-lncRNA (long non-coding RNA) associations. Then, a heterogeneous network was constructed by connecting the disease similarity subnetwork and the miRNA similarity subnetwork using the known miRNA-disease associations. We extended random walk with restart to predict miRNA-disease associations in the heterogeneous network. The leave-one-out cross-validation achieved an average area under the curve (AUC) of 0:8049 across 341 diseases and 476 miRNAs. For five-fold cross-validation, our method achieved an AUC from 0:7970 to 0:9249 for 15 human diseases. Case studies further demonstrated the feasibility of our method to discover potential miRNA-disease associations. An online service for prediction is freely available at http://ifmda.aliapp.com.

11.
Adv Exp Med Biol ; 919: 237-242, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27975221

RESUMO

Protein inference is one of the most important steps in protein identification, which transforms peptides identified from tandem mass spectra into a list of proteins. In this chapter, we provide a brief introduction on this problem and present a short summary on the existing protein inference methods in the literature.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Bases de Dados de Proteínas , Proteínas/análise , Proteoma , Proteômica/métodos , Espectrometria de Massas em Tandem/métodos , Algoritmos , Animais , Ensaios de Triagem em Larga Escala , Humanos , Reprodutibilidade dos Testes , Fluxo de Trabalho
12.
Comput Biol Chem ; 63: 21-29, 2016 08.
Artigo em Inglês | MEDLINE | ID: mdl-26935399

RESUMO

In mass spectrometry-based shotgun proteomics, protein quantification and protein identification are two major computational problems. To quantify the protein abundance, a list of proteins must be firstly inferred from the raw data. Then the relative or absolute protein abundance is estimated with quantification methods, such as spectral counting. Until now, most researchers have been dealing with these two processes separately. In fact, the protein inference problem can be regarded as a special protein quantification problem in the sense that truly present proteins are those proteins whose abundance values are not zero. Some recent published papers have conceptually discussed this possibility. However, there is still a lack of rigorous experimental studies to test this hypothesis. In this paper, we investigate the feasibility of using protein quantification methods to solve the protein inference problem. Protein inference methods aim to determine whether each candidate protein is present in the sample or not. Protein quantification methods estimate the abundance value of each inferred protein. Naturally, the abundance value of an absent protein should be zero. Thus, we argue that the protein inference problem can be viewed as a special protein quantification problem in which one protein is considered to be present if its abundance is not zero. Based on this idea, our paper tries to use three simple protein quantification methods to solve the protein inference problem effectively. The experimental results on six data sets show that these three methods are competitive with previous protein inference algorithms. This demonstrates that it is plausible to model the protein inference problem as a special protein quantification task, which opens the door of devising more effective protein inference algorithms from a quantification perspective. The source codes of our methods are available at: http://code.google.com/p/protein-inference/.


Assuntos
Proteínas/química , Algoritmos , Animais , Humanos , Espectrometria de Massas
13.
Comput Biol Chem ; 57: 12-20, 2015 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-25707552

RESUMO

Protein inference from the identified peptides is of primary importance in the shotgun proteomics. The target of protein inference is to identify whether each candidate protein is truly present in the sample. To date, many computational methods have been proposed to solve this problem. However, there is still no method that can fully utilize the information hidden in the input data. In this article, we propose a learning-based method named BagReg for protein inference. The method firstly artificially extracts five features from the input data, and then chooses each feature as the class feature to separately build models to predict the presence probabilities of proteins. Finally, the weak results from five prediction models are aggregated to obtain the final result. We test our method on six public available data sets. The experimental results show that our method is superior to the state-of-the-art protein inference algorithms.


Assuntos
Biologia Computacional , Aprendizado de Máquina , Proteínas/química , Bases de Dados de Proteínas
14.
Brief Bioinform ; 16(4): 658-74, 2015 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-25378435

RESUMO

Protein-protein interaction is of primary importance to understand protein functions. In recent years, the high-throughput AP-MS experiments have generated a large amount of bait-prey data, posing great challenges on the computational analysis of such data for inferring true interactions and protein complexes. To date, many research efforts have been devoted to developing novel computational methods to analyze these AP-MS data sets. In this article, we review and classify the key computational methods developed for the inference of protein-protein interactions and the detection of protein complexes from the AP-MS experiments. We hope that our review as well as the challenges highlighted in the article will provide valuable insights into driving future research for further advancing the state-of-the-art technologies in computational prediction, characterization and analysis of protein-protein interactions and protein complexes from the AP-MS data.


Assuntos
Cromatografia de Afinidade/métodos , Espectrometria de Massas/métodos , Proteínas/química , Ligação Proteica
15.
Brief Bioinform ; 16(5): 884-900, 2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-25433466

RESUMO

Discriminative pattern mining is one of the most important techniques in data mining. This challenging task is concerned with finding a set of patterns that occur with disproportionate frequency in data sets with various class labels. Such patterns are of great value for group difference detection and classifier construction. Research on finding interesting discriminative patterns in class-labeled data evolves rapidly and lots of algorithms have been proposed to specifically address this problem. Discriminative pattern mining techniques have proven their considerable value in biological data analysis. The archetypical applications in bioinformatics include phosphorylation motif discovery, differentially expressed gene identification, discriminative genotype pattern detection, etc. In this article, we present an overview of discriminative pattern mining and the corresponding effective methods, and subsequently we illustrate their applications to tackling the bioinformatics problems. In the end, we give a general discussion of potential challenges and future work for this task.


Assuntos
Biologia Computacional , Mineração de Dados , Algoritmos , Modelos Teóricos , Software
16.
Brief Bioinform ; 15(5): 839-55, 2014 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23543354

RESUMO

Protein phosphorylation is one of the most pervasive post-translational modifications, regulating diverse cellular processes in various organisms. As mass spectrometry-based experimental approaches for identifying phosphorylation events are resource-intensive, many computational methods have been proposed, in which phosphorylation site prediction is formulated as a classification problem. They differ in several ways, and one crucial issue is the construction of training data and test data for unbiased performance evaluation. In this article, we categorize the existing data construction methods and try to answer three questions: (i) Is it equivalent to use different data construction methods in the assessment of phosphorylation site prediction algorithms? (ii) What kind of test data set is unbiased for assessing the prediction performance of a trained algorithm in different real world scenarios? (iii) Among the summarized training data construction methods, which one(s) has better generalization performance for most scenarios? To answer these questions, we conduct comprehensive experimental studies for both non-kinase-specific and kinase-specific prediction tasks. The experimental results show that: (i) different data construction methods can lead to significantly different prediction performance; (ii) there can be different test data construction methods that are unbiased with respect to different real world scenarios; and (iii) different data construction methods have different generalization performance in different real world scenarios. Therefore, when developing new algorithms in future research, people should concentrate on what kind of scenario their algorithm will work for, what the corresponding unbiased test data are and which training data construction method can generate best generalization performance.


Assuntos
Proteínas/metabolismo , Algoritmos , Fosforilação
17.
Bioinformatics ; 30(5): 675-81, 2014 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-23926225

RESUMO

MOTIVATION: Statistical validation of protein identifications is an important issue in shotgun proteomics. The false discovery rate (FDR) is a powerful statistical tool for evaluating the protein identification result. Several research efforts have been made for FDR estimation at the protein level. However, there are still certain drawbacks in the existing FDR estimation methods based on the target-decoy strategy. RESULTS: In this article, we propose a decoy-free protein-level FDR estimation method. Under the null hypothesis that each candidate protein matches an identified peptide totally at random, we assign statistical significance to protein identifications in terms of the permutation P-value and use these P-values to calculate the FDR. Our method consists of three key steps: (i) generating random bipartite graphs with the same structure; (ii) calculating the protein scores on these random graphs; and (iii) calculating the permutation P value and final FDR. As it is time-consuming or prohibitive to execute the protein inference algorithms for thousands of times in step ii, we first train a linear regression model using the original bipartite graph and identification scores provided by the target inference algorithm. Then we use the learned regression model as a substitute of original protein inference method to predict protein scores on shuffled graphs. We test our method on six public available datasets. The results show that our method is comparable with those state-of-the-art algorithms in terms of estimation accuracy. AVAILABILITY: The source code of our algorithm is available at: https://sourceforge.net/projects/plfdr/


Assuntos
Proteínas/química , Proteômica/métodos , Algoritmos , Humanos , Modelos Lineares , Peptídeos/química
18.
Artigo em Inglês | MEDLINE | ID: mdl-26355518

RESUMO

MicroRNA (miRNA) plays an important role as a regulator in biological processes. Identification of (pre-) miRNAs helps in understanding regulatory processes. Machine learning methods have been designed for pre-miRNA identification. However, most of them cannot provide reliable predictive performances on independent testing data sets. We assumed this is because the training sets, especially the negative training sets, are not sufficiently representative. To generate a representative negative set, we proposed a novel negative sample selection technique, and successfully collected negative samples with improved quality. Two recent classifiers rebuilt with the proposed negative set achieved an improvement of ~6 percent in their predictive performance, which confirmed this assumption. Based on the proposed negative set, we constructed a training set, and developed an online system called miRNApre specifically for human pre-miRNA identification. We showed that miRNApre achieved accuracies on updated human and non-human data sets that were 34.3 and 7.6 percent higher than those achieved by current methods. The results suggest that miRNApre is an effective tool for pre-miRNA identification. Additionally, by integrating miRNApre, we developed a miRNA mining tool, mirnaDetect, which can be applied to find potential miRNAs in genome-scale data. MirnaDetect achieved a comparable mining performance on human chromosome 19 data as other existing methods.


Assuntos
Biologia Computacional/métodos , MicroRNAs/genética , Humanos , Análise de Sequência de DNA , Máquina de Vetores de Suporte
19.
Artigo em Inglês | MEDLINE | ID: mdl-26356863

RESUMO

Phosphorylation motifs represent position-specific amino acid patterns around the phosphorylation sites in the set of phosphopeptides. Several algorithms have been proposed to uncover phosphorylation motifs, whereas the problem of efficiently discovering a set of significant motifs with sufficiently high coverage and non-redundancy still remains unsolved. Here we present a novel notion called conditional phosphorylation motifs. Through this new concept, the motifs whose over-expressiveness mainly benefits from its constituting parts can be filtered out effectively. To discover conditional phosphorylation motifs, we propose an algorithm called C-Motif for a non-redundant identification of significant phosphorylation motifs. C-Motif is implemented under the Apriori framework, and it tests the statistical significance together with the frequency of candidate motifs in a single stage. Experiments demonstrate that C-Motif outperforms some current algorithms such as MMFPh and Motif-All in terms of coverage and non-redundancy of the results and efficiency of the execution. The source code of C-Motif is available at: https://sourceforge. net/projects/cmotif/.


Assuntos
Motivos de Aminoácidos , Biologia Computacional/métodos , Fosfopeptídeos/química , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Mineração de Dados , Reconhecimento Automatizado de Padrão , Fosforilação
20.
Comput Biol Chem ; 43: 46-54, 2013 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-23385215

RESUMO

Protein inference is an important issue in proteomics research. Its main objective is to select a proper subset of candidate proteins that best explain the observed peptides. Although many methods have been proposed for solving this problem, several issues such as peptide degeneracy and one-hit wonders still remain unsolved. Therefore, the accurate identification of proteins that are truly present in the sample continues to be a challenging task. Based on the concept of peptide detectability, we formulate the protein inference problem as a constrained Lasso regression problem, which can be solved very efficiently through a coordinate descent procedure. The new inference algorithm is named as ProteinLasso, which explores an ensemble learning strategy to address the sparsity parameter selection problem in Lasso model. We test the performance of ProteinLasso on three datasets. As shown in the experimental results, ProteinLasso outperforms those state-of-the-art protein inference algorithms in terms of both identification accuracy and running efficiency. In addition, we show that ProteinLasso is stable under different parameter specifications. The source code of our algorithm is available at: http://sourceforge.net/projects/proteinlasso.


Assuntos
Algoritmos , Proteômica , Análise de Sequência de Proteína , Bases de Dados de Proteínas , Humanos , Modelos Estatísticos , Peptídeos/química , Análise de Regressão
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...