Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 23
Filter
Add more filters










Publication year range
1.
RNA ; 23(3): 270-283, 2017 03.
Article in English | MEDLINE | ID: mdl-27994090

ABSTRACT

Introns are found in 5' untranslated regions (5'UTRs) for 35% of all human transcripts. These 5'UTR introns are not randomly distributed: Genes that encode secreted, membrane-bound and mitochondrial proteins are less likely to have them. Curiously, transcripts lacking 5'UTR introns tend to harbor specific RNA sequence elements in their early coding regions. To model and understand the connection between coding-region sequence and 5'UTR intron status, we developed a classifier that can predict 5'UTR intron status with >80% accuracy using only sequence features in the early coding region. Thus, the classifier identifies transcripts with 5' proximal-intron-minus-like-coding regions ("5IM" transcripts). Unexpectedly, we found that the early coding sequence features defining 5IM transcripts are widespread, appearing in 21% of all human RefSeq transcripts. The 5IM class of transcripts is enriched for non-AUG start codons, more extensive secondary structure both preceding the start codon and near the 5' cap, greater dependence on eIF4E for translation, and association with ER-proximal ribosomes. 5IM transcripts are bound by the exon junction complex (EJC) at noncanonical 5' proximal positions. Finally, N1-methyladenosines are specifically enriched in the early coding regions of 5IM transcripts. Taken together, our analyses point to the existence of a distinct 5IM class comprising ∼20% of human transcripts. This class is defined by depletion of 5' proximal introns, presence of specific RNA sequence features associated with low translation efficiency, N1-methyladenosines in the early coding region, and enrichment for noncanonical binding by the EJC.


Subject(s)
5' Untranslated Regions , Adenosine/analogs & derivatives , Base Sequence , Introns , Protein Biosynthesis , Sequence Deletion , Adenosine/genetics , Adenosine/metabolism , Codon, Initiator/chemistry , Codon, Initiator/metabolism , Eukaryotic Initiation Factor-4E/genetics , Eukaryotic Initiation Factor-4E/metabolism , Exons , Humans , Open Reading Frames , Protein Binding , Ribosomes/genetics , Ribosomes/metabolism
2.
J Proteome Res ; 14(1): 457-66, 2015 Jan 02.
Article in English | MEDLINE | ID: mdl-25299736

ABSTRACT

Threatened preterm labor (TPTL) accounts for ∼30% of pregnancy-related hospital admissions. Maternal peripheral leukocytes can be used to monitor a variety of physiological processes occurring in the body. Two high-throughput mass spectrometry methodologies, SWATH and iTRAQ, were used to study differentially expressed peripheral blood leukocyte lysate proteins in symptomatic women admitted for TPTL who had a preterm birth within 48 h (n = 16) and those who did not (n = 24). The SWATH spectral library consisted of 783 proteins. SWATH methodology quantified 258 proteins (using ≥2 peptides) and 5 proteins (ALBU, ANXA6, HNRPK, HSP90A, and PDIA1) were differentially expressed (p < 0.05, Mann-Whitney U). iTRAQ workflow identified 765 proteins; 354 proteins were quantified and 14 proteins (MIF, UBIQ, HXK3, ALBU, HNRPD, ST1A2, RS15A, RAP1B, CAN1, IQGA2, ST1A1, COX5A, ADDA, and UBQL1) were significantly different between the two groups of women (p < 0.05, Mann-Whitney U). Albumin was the only common differentially expressed protein in both SWATH (28% decrease) and iTRAQ studies (45% decrease). This decrease in albumin was validated using ELISA (11% decrease, p < 0.05, Mann-Whitney U) in another 23 TPTL women. This work suggests that albumin is a broad indicator of leukocyte activation with impending preterm birth and provides new future work directions to understand the pathophysiology of TPTL.


Subject(s)
Gene Expression Regulation/physiology , Obstetric Labor, Premature/blood , Obstetric Labor, Premature/physiopathology , Premature Birth/physiopathology , Serum Albumin/metabolism , Enzyme-Linked Immunosorbent Assay , Female , High-Throughput Screening Assays/methods , Humans , Mass Spectrometry/methods , Pregnancy , Premature Birth/blood , Statistics, Nonparametric , Time Factors , Western Australia
3.
BMC Syst Biol ; 8 Suppl 4: S7, 2014.
Article in English | MEDLINE | ID: mdl-25521415

ABSTRACT

BACKGROUND: The human habitat is a host where microbial species evolve, function, and continue to evolve. Elucidating how microbial communities respond to human habitats is a fundamental and critical task, as establishing baselines of human microbiome is essential in understanding its role in human disease and health. Recent studies on healthy human microbiome focus on particular body habitats, assuming that microbiome develop similar structural patterns to perform similar ecosystem function under same environmental conditions. However, current studies usually overlook a complex and interconnected landscape of human microbiome and limit the ability in particular body habitats with learning models of specific criterion. Therefore, these methods could not capture the real-world underlying microbial patterns effectively. RESULTS: To obtain a comprehensive view, we propose a novel ensemble clustering framework to mine the structure of microbial community pattern on large-scale metagenomic data. Particularly, we first build a microbial similarity network via integrating 1920 metagenomic samples from three body habitats of healthy adults. Then a novel symmetric Nonnegative Matrix Factorization (NMF) based ensemble model is proposed and applied onto the network to detect clustering pattern. Extensive experiments are conducted to evaluate the effectiveness of our model on deriving microbial community with respect to body habitat and host gender. From clustering results, we observed that body habitat exhibits a strong bound but non-unique microbial structural pattern. Meanwhile, human microbiome reveals different degree of structural variations over body habitat and host gender. CONCLUSIONS: In summary, our ensemble clustering framework could efficiently explore integrated clustering results to accurately identify microbial communities, and provide a comprehensive view for a set of microbial communities. The clustering results indicate that structure of human microbiome is varied systematically across body habitats and host genders. Such trends depict an integrated biography of microbial communities, which offer a new insight towards uncovering pathogenic model of human microbiome.


Subject(s)
Computational Biology/methods , Ecosystem , Metagenomics , Microbiology , Adult , Algorithms , Cluster Analysis , Female , Humans , Male , Sex Factors
4.
J Bioinform Comput Biol ; 12(4): 1450014, 2014 Aug.
Article in English | MEDLINE | ID: mdl-25152039

ABSTRACT

Protein-protein interactions (PPIs) are important for understanding the cellular mechanisms of biological functions, but the reliability of PPIs extracted by high-throughput assays is known to be low. To address this, many current methods use multiple evidence from different sources of information to compute reliability scores for such PPIs. However, they often combine the evidence without taking into account the uncertainty of the evidence values, potential dependencies between the information sources used and missing values from some information sources. We propose to formulate the task of scoring PPIs using multiple information sources as a multi-criteria decision making problem that can be solved using data fusion to model potential interactions between the multiple information sources. Using data fusion, the amount of contribution from each information source can be proportioned accordingly to systematically score the reliability of PPIs. Our experimental results showed that the reliability scores assigned by our data fusion method can effectively classify highly reliable PPIs from multiple information sources, with substantial improvement in scoring over conventional approach such as the Adjust CD-Distance approach. In addition, the underlying interactions between the information sources used, as well as their relative importance, can also be determined with our data fusion approach. We also showed that such knowledge can be used to effectively handle missing values from information sources.


Subject(s)
Computational Biology/methods , Protein Interaction Mapping/methods , Decision Making, Computer-Assisted , Gene Expression , High-Throughput Screening Assays , Reproducibility of Results
5.
PLoS One ; 9(5): e97079, 2014.
Article in English | MEDLINE | ID: mdl-24816822

ABSTRACT

An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.


Subject(s)
Algorithms , Artificial Intelligence/trends , Computational Biology/methods , Gene Regulatory Networks/genetics , Genetic Association Studies/methods , Genetic Diseases, Inborn/genetics , Models, Genetic , Gene Ontology , Humans , Phenotype , Selection Bias
6.
PLoS One ; 9(5): e96901, 2014.
Article in English | MEDLINE | ID: mdl-24828675

ABSTRACT

Threatened preterm labor (TPTL) is defined as persistent premature uterine contractions between 20 and 37 weeks of gestation and is the most common condition that requires hospitalization during pregnancy. Most of these TPTL women continue their pregnancies to term while only an estimated 5% will deliver a premature baby within ten days. The aim of this work was to study differential whole blood gene expression associated with spontaneous preterm birth (sPTB) within 48 hours of hospital admission. Peripheral blood was collected at point of hospital admission from 154 women with TPTL before any medical treatment. Microarrays were utilized to investigate differential whole blood gene expression between TPTL women who did (n = 48) or did not have a sPTB (n = 106) within 48 hours of admission. Total leukocyte and neutrophil counts were significantly higher (35% and 41% respectively) in women who had sPTB than women who did not deliver within 48 hours (p<0.001). Fetal fibronectin (fFN) test was performed on 62 women. There was no difference in the urine, vaginal and placental microbiology and histopathology reports between the two groups of women. There were 469 significant differentially expressed genes (FDR<0.05); 28 differentially expressed genes were chosen for microarray validation using qRT-PCR and 20 out of 28 genes were successfully validated (p<0.05). An optimal random forest classifier model to predict sPTB was achieved using the top nine differentially expressed genes coupled with peripheral clinical blood data (sensitivity 70.8%, specificity 75.5%). These differentially expressed genes may further elucidate the underlying mechanisms of sPTB and pave the way for future systems biology studies to predict sPTB.


Subject(s)
Abortion, Threatened/genetics , Blood Cells/metabolism , Gene Expression , Obstetric Labor, Premature/genetics , Premature Birth/genetics , Abortion, Threatened/blood , Abortion, Threatened/physiopathology , Adult , Female , Fibronectins/blood , Fibronectins/genetics , Gene Expression Profiling , Gestational Age , Humans , Infant, Newborn , Obstetric Labor, Premature/blood , Obstetric Labor, Premature/physiopathology , Oligonucleotide Array Sequence Analysis , Pregnancy , Premature Birth/blood , Term Birth/blood , Term Birth/genetics
7.
Development ; 141(1): 224-35, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24346703

ABSTRACT

Comprehensive functional annotation of vertebrate genomes is fundamental to biological discovery. Reverse genetic screening has been highly useful for determination of gene function, but is untenable as a systematic approach in vertebrate model organisms given the number of surveyable genes and observable phenotypes. Unbiased prediction of gene-phenotype relationships offers a strategy to direct finite experimental resources towards likely phenotypes, thus maximizing de novo discovery of gene functions. Here we prioritized genes for phenotypic assay in zebrafish through machine learning, predicting the effect of loss of function of each of 15,106 zebrafish genes on 338 distinct embryonic anatomical processes. Focusing on cardiovascular phenotypes, the learning procedure predicted known knockdown and mutant phenotypes with high precision. In proof-of-concept studies we validated 16 high-confidence cardiac predictions using targeted morpholino knockdown and initial blinded phenotyping in embryonic zebrafish, confirming a significant enrichment for cardiac phenotypes as compared with morpholino controls. Subsequent detailed analyses of cardiac function confirmed these results, identifying novel physiological defects for 11 tested genes. Among these we identified tmem88a, a recently described attenuator of Wnt signaling, as a discrete regulator of the patterning of intercellular coupling in the zebrafish cardiac epithelium. Thus, we show that systematic prioritization in zebrafish can accelerate the pace of developmental gene function discovery.


Subject(s)
Gene Expression Regulation, Developmental , Heart/embryology , Membrane Proteins/metabolism , Myocardium/cytology , Zebrafish Proteins/metabolism , Zebrafish/embryology , Zebrafish/genetics , Animals , Embryo, Nonmammalian/metabolism , Gene Knockdown Techniques , Membrane Proteins/genetics , Morpholinos/genetics , Phenotype , Wnt Signaling Pathway/genetics , Zebrafish Proteins/genetics
8.
G3 (Bethesda) ; 2(2): 223-33, 2012 Feb.
Article in English | MEDLINE | ID: mdl-22384401

ABSTRACT

The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented-alongside existing validated annotations-in a publicly accessible and searchable web interface.

9.
BMC Syst Biol ; 6 Suppl 2: S13, 2012.
Article in English | MEDLINE | ID: mdl-23281936

ABSTRACT

BACKGROUND: Protein complexes participate in many important cellular functions, so finding the set of existent complexes is essential for understanding the organization and regulation of processes in the cell. With the availability of large amounts of high-throughput protein-protein interaction (PPI) data, many algorithms have been proposed to discover protein complexes from PPI networks. However, such approaches are hindered by the high rate of noise in high-throughput PPI data, including spurious and missing interactions. Furthermore, many transient interactions are detected between proteins that are not from the same complex, while not all proteins from the same complex may actually interact. As a result, predicted complexes often do not match true complexes well, and many true complexes go undetected. RESULTS: We address these challenges by integrating PPI data with other heterogeneous data sources to construct a composite protein network, and using a supervised maximum-likelihood approach to weight each edge based on its posterior probability of belonging to a complex. We then use six different clustering algorithms, and an aggregative clustering strategy, to discover complexes in the weighted network. We test our method on Saccharomyces cerevisiae and Homo sapiens, and show that complex discovery is improved: compared to previously proposed supervised and unsupervised weighting approaches, our method recalls more known complexes, achieves higher precision at all recall levels, and generates novel complexes of greater functional similarity. Furthermore, our maximum-likelihood approach allows learned parameters to be used to visualize and evaluate the evidence of novel predictions, aiding human judgment of their credibility. CONCLUSIONS: Our approach integrates multiple data sources with supervised learning to create a weighted composite protein network, and uses six clustering algorithms with an aggregative clustering strategy to discover novel complexes. We show improved performance over previous approaches in terms of precision, recall, and number and quality of novel predictions. We present and visualize two novel predicted complexes in yeast and human, and find external evidence supporting these predictions.


Subject(s)
Computational Biology/methods , Protein Interaction Maps , BRCA1 Protein/metabolism , Bayes Theorem , Cluster Analysis , Humans , Likelihood Functions , Saccharomyces cerevisiae Proteins/metabolism
10.
Proteome Sci ; 9 Suppl 1: S15, 2011 Oct 14.
Article in English | MEDLINE | ID: mdl-22165860

ABSTRACT

BACKGROUND: Protein complexes are important for understanding principles of cellular organization and functions. With the availability of large amounts of high-throughput protein-protein interactions (PPI), many algorithms have been proposed to discover protein complexes from PPI networks. However, existing algorithms generally do not take into consideration the fact that not all the interactions in a PPI network take place at the same time. As a result, predicted complexes often contain many spuriously included proteins, precluding them from matching true complexes. RESULTS: We propose two methods to tackle this problem: (1) The localization GO term decomposition method: We utilize cellular component Gene Ontology (GO) terms to decompose PPI networks into several smaller networks such that the proteins in each decomposed network are annotated with the same cellular component GO term. (2) The hub removal method: This method is based on the observation that hub proteins are more likely to fuse clusters that correspond to different complexes. To avoid this, we remove hub proteins from PPI networks, and then apply a complex discovery algorithm on the remaining PPI network. The removed hub proteins are added back to the generated clusters afterwards. We tested the two methods on the yeast PPI network downloaded from BioGRID. Our results show that these methods can improve the performance of several complex discovery algorithms significantly. Further improvement in performance is achieved when we apply them in tandem. CONCLUSIONS: The performance of complex discovery algorithms is hindered by the fact that not all the interactions in a PPI network take place at the same time. We tackle this problem by using localization GO terms or hubs to decompose a PPI network before complex discovery, which achieves considerable improvement.

11.
Mol Syst Biol ; 7: 544, 2011 Nov 08.
Article in English | MEDLINE | ID: mdl-22068327

ABSTRACT

Drug synergy allows a therapeutic effect to be achieved with lower doses of component drugs. Drug synergy can result when drugs target the products of genes that act in parallel pathways ('specific synergy'). Such cases of drug synergy should tend to correspond to synergistic genetic interaction between the corresponding target genes. Alternatively, 'promiscuous synergy' can arise when one drug non-specifically increases the effects of many other drugs, for example, by increased bioavailability. To assess the relative abundance of these drug synergy types, we examined 200 pairs of antifungal drugs in S. cerevisiae. We found 38 antifungal synergies, 37 of which were novel. While 14 cases of drug synergy corresponded to genetic interaction, 92% of the synergies we discovered involved only six frequently synergistic drugs. Although promiscuity of four drugs can be explained under the bioavailability model, the promiscuity of Tacrolimus and Pentamidine was completely unexpected. While many drug synergies correspond to genetic interactions, the majority of drug synergies appear to result from non-specific promiscuous synergy.


Subject(s)
Antifungal Agents/pharmacology , Drug Synergism , Saccharomyces cerevisiae/drug effects , Antifungal Agents/pharmacokinetics , Biological Availability , Drug Interactions , Pentamidine/pharmacokinetics , Pentamidine/pharmacology , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Tacrolimus/pharmacokinetics , Tacrolimus/pharmacology
12.
J Biol Chem ; 286(27): 23653-8, 2011 Jul 08.
Article in English | MEDLINE | ID: mdl-21566122

ABSTRACT

Computational systems biology is empowering the study of drug action. Studies on biological effects of chemical compounds have increased in scale and accessibility, allowing integration with other large-scale experimental data types. Here, we review computational approaches for elucidating the mechanisms of both intended and undesirable effects of drugs, with the collective potential to change the nature of drug discovery and pharmacological therapy.


Subject(s)
Drug Discovery/methods , Systems Biology/methods , Animals , Drug Discovery/trends , Humans , Systems Biology/trends
13.
PLoS Genet ; 7(4): e1001366, 2011 Apr.
Article in English | MEDLINE | ID: mdl-21533221

ABSTRACT

In higher eukaryotes, messenger RNAs (mRNAs) are exported from the nucleus to the cytoplasm via factors deposited near the 5' end of the transcript during splicing. The signal sequence coding region (SSCR) can support an alternative mRNA export (ALREX) pathway that does not require splicing. However, most SSCR-containing genes also have introns, so the interplay between these export mechanisms remains unclear. Here we support a model in which the furthest upstream element in a given transcript, be it an intron or an ALREX-promoting SSCR, dictates the mRNA export pathway used. We also experimentally demonstrate that nuclear-encoded mitochondrial genes can use the ALREX pathway. Thus, ALREX can also be supported by nucleotide signals within mitochondrial-targeting sequence coding regions (MSCRs). Finally, we identified and experimentally verified novel motifs associated with the ALREX pathway that are shared by both SSCRs and MSCRs. Our results show strong correlation between 5' untranslated region (5'UTR) intron presence/absence and sequence features at the beginning of the coding region. They also suggest that genes encoding secretory and mitochondrial proteins share a common regulatory mechanism at the level of mRNA export.


Subject(s)
5' Untranslated Regions/genetics , Alternative Splicing , Cell Nucleus/metabolism , RNA Transport , RNA, Messenger/metabolism , Active Transport, Cell Nucleus , Adenine/metabolism , Cytoplasm , Endoplasmic Reticulum/genetics , Gene Expression Regulation , Genes, Mitochondrial , Humans , Introns , Models, Genetic , Open Reading Frames , Protein Sorting Signals , RNA Splicing
14.
BMC Bioinformatics ; 11 Suppl 7: S8, 2010 Oct 15.
Article in English | MEDLINE | ID: mdl-21106130

ABSTRACT

BACKGROUND: Protein-protein interactions (PPIs) play important roles in various cellular processes. However, the low quality of current PPI data detected from high-throughput screening techniques has diminished the potential usefulness of the data. We need to develop a method to address the high data noise and incompleteness of PPI data, namely, to filter out inaccurate protein interactions (false positives) and predict putative protein interactions (false negatives). RESULTS: In this paper, we proposed a novel two-step method to integrate diverse biological and computational sources of supporting evidence for reliable PPIs. The first step, interaction binning or InterBIN, groups PPIs together to more accurately estimate the likelihood (Bin-Confidence score) that the protein pairs interact for each biological or computational evidence source. The second step, interaction classification or InterCLASS, integrates the collected Bin-Confidence scores to build classifiers and identify reliable interactions. CONCLUSIONS: We performed comprehensive experiments on two benchmark yeast PPI datasets. The experimental results showed that our proposed method can effectively eliminate false positives in detected PPIs and identify false negatives by predicting novel yet reliable PPIs. Our proposed method also performed significantly better than merely using each of individual evidence sources, illustrating the importance of integrating various biological and computational sources of data and evidence.


Subject(s)
Computational Biology/methods , Saccharomyces cerevisiae Proteins/metabolism , Protein Interaction Mapping/methods , Reproducibility of Results , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/genetics , Software
15.
Bioinformatics ; 25(15): 1891-7, 2009 Aug 01.
Article in English | MEDLINE | ID: mdl-19435747

ABSTRACT

MOTIVATION: Protein complexes are important for understanding principles of cellular organization and function. High-throughput experimental techniques have produced a large amount of protein interactions, which makes it possible to predict protein complexes from protein-protein interaction (PPI) networks. However, protein interaction data produced by high-throughput experiments are often associated with high false positive and false negative rates, which makes it difficult to predict complexes accurately. RESULTS: We use an iterative scoring method to assign weight to protein pairs, and the weight of a protein pair indicates the reliability of the interaction between the two proteins. We develop an algorithm called CMC (clustering-based on maximal cliques) to discover complexes from the weighted PPI network. CMC first generates all the maximal cliques from the PPI networks, and then removes or merges highly overlapped clusters based on their interconnectivity. We studied the performance of CMC and the impact of our iterative scoring method on CMC. Our results show that: (i) the iterative scoring method can improve the performance of CMC considerably; (ii) the iterative scoring method can effectively reduce the impact of random noise on the performance of CMC; (iii) the iterative scoring method can also improve the performance of other protein complex prediction methods and reduce the impact of random noise on their performance; and (iv) CMC is an effective approach to protein complex prediction from protein interaction network. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Computational Biology/methods , Protein Interaction Mapping/methods , Proteins/chemistry , Binding Sites , Cluster Analysis
16.
Ann N Y Acad Sci ; 1158: 224-33, 2009 Mar.
Article in English | MEDLINE | ID: mdl-19348644

ABSTRACT

The protein-protein subnetwork prediction challenge presented at the 2nd Dialogue for Reverse Engineering Assessments and Methods (DREAM2) conference is an important computational problem essential to proteomic research. Given a set of proteins from the Saccharomyces cerevisiae (baker's yeast) genome, the task is to rank all possible interactions between the proteins from the most likely to the least likely. To tackle this task, we adopt a graph-based strategy to combine multiple sources of biological data and computational predictions. Using training and testing sets extracted from existing yeast protein-protein interactions, we evaluate our method and show that it can produce better predictions than any of the individual data sources. This technique is then used to produce our entry for the protein-protein subnetwork prediction challenge.


Subject(s)
Computational Biology/methods , Protein Interaction Mapping , Saccharomyces cerevisiae Proteins , Area Under Curve , Databases, Protein , Genome, Fungal , Models, Genetic , ROC Curve , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae Proteins/metabolism
17.
Drug Discov Today ; 13(15-16): 652-8, 2008 Aug.
Article in English | MEDLINE | ID: mdl-18595769

ABSTRACT

Protein interactions are crucial components of all cellular processes. An in-depth knowledge of the full complement of protein interactions in a cell, therefore, provides insight into the structure, properties and functions of the cell and its components. An accurate and comprehensive protein interaction network is, thus, an invaluable framework to study protein regulation in disease. Although the amount of protein-protein interaction data has grown significantly because of advances in high-throughput experimental techniques, these high-throughput methods are highly susceptible to noise. Therefore, computational techniques for assessing the reliability of a protein-protein interaction are highly desirable. We review here computational techniques for assessing and improving the reliability of protein-protein interaction data from these high-throughput experiments.


Subject(s)
Protein Interaction Mapping/methods , Proteins/metabolism , Animals , Computational Biology/methods , Humans , Protein Binding , Proteins/chemistry , Proteins/genetics , Reproducibility of Results , Sequence Analysis, Protein
18.
J Bioinform Comput Biol ; 6(3): 435-66, 2008 Jun.
Article in English | MEDLINE | ID: mdl-18574858

ABSTRACT

Protein complexes are fundamental for understanding principles of cellular organizations. As the sizes of protein-protein interaction (PPI) networks are increasing, accurate and fast protein complex prediction from these PPI networks can serve as a guide for biological experiments to discover novel protein complexes. However, it is not easy to predict protein complexes from PPI networks, especially in situations where the PPI network is noisy and still incomplete. Here, we study the use of indirect interactions between level-2 neighbors (level-2 interactions) for protein complex prediction. We know from previous work that proteins which do not interact but share interaction partners (level-2 neighbors) often share biological functions. We have proposed a method in which all direct and indirect interactions are first weighted using topological weight (FS-Weight), which estimates the strength of functional association. Interactions with low weight are removed from the network, while level-2 interactions with high weight are introduced into the interaction network. Existing clustering algorithms can then be applied to this modified network. We have also proposed a novel algorithm that searches for cliques in the modified network, and merge cliques to form clusters using a "partial clique merging" method. Experiments show that (1) the use of indirect interactions and topological weight to augment protein-protein interactions can be used to improve the precision of clusters predicted by various existing clustering algorithms; and (2) our complex-finding algorithm performs very well on interaction networks modified in this way. Since no other information except the original PPI network is used, our approach would be very useful for protein complex prediction, especially for prediction of novel protein complexes.


Subject(s)
Computational Biology , Computer Simulation , Protein Interaction Mapping , Algorithms
19.
Bioinformatics ; 23(24): 3364-73, 2007 Dec 15.
Article in English | MEDLINE | ID: mdl-18048396

ABSTRACT

MOTIVATION: With the increasing availability of diverse biological information, protein function prediction approaches have converged towards integration of heterogeneous data. Many adapted existing techniques, such as machine-learning and probabilistic methods, which have proven successful on specific data types. However, the impact of these approaches is hindered by a couple of factors. First, there is little comparison between existing approaches. This is in part due to a divergence in the focus adopted by different works, which makes comparison difficult or even fuzzy. Second, there seems to be over-emphasis on the use of computationally demanding machine-learning methods, which runs counter to the surge in biological data. Analogous to the success of BLAST for sequence homology search, we believe that the ability to tap escalating quantity, quality and diversity of biological data is crucial to the success of automated function prediction as a useful instrument for the advancement of proteomic research. We address these problems by: (1) providing useful comparison between some prominent methods; (2) proposing Integrated Weighted Averaging (IWA)--a scalable, efficient and flexible function prediction framework that integrates diverse information using simple weighting strategies and a local prediction method. The simplicity of the approach makes it possible to make predictions based on on-the-fly information fusion. RESULTS: In addition to its greater efficiency, IWA performs exceptionally well against existing approaches. In the presence of cross-genome information, which is overwhelming for existing approaches, IWA makes even better predictions. We also demonstrate the significance of appropriate weighting strategies in data integration.


Subject(s)
Computational Biology/methods , Database Management Systems , Databases, Protein , Information Storage and Retrieval/methods , Models, Biological , Proteins/chemistry , Proteins/physiology , Computer Simulation , Structure-Activity Relationship , Systems Integration
20.
Article in English | MEDLINE | ID: mdl-17951816

ABSTRACT

Protein complexes are fundamental for understanding principles of cellular organizations. Accurate and fast protein complex prediction from the PPI networks of increasing sizes can serve as a guide for biological experiments to discover novel protein complexes. However, protein complex prediction from PPI networks is a hard problem, especially in situations where the PPI network is noisy. We know from previous work that proteins that do not interact, but share interaction partners (level-2 neighbors) often share biological functions. The strength of functional association can be estimated using a topological weight, FS-Weight. Here we study the use of indirect interactions between level-2 neighbors (level-2 interactions) for protein complex prediction. All direct and indirect interactions are first weighted using topological weight (FS-Weight). Interactions with low weight are removed from the network, while level-2 interactions with high weight are introduced into the interaction network. Existing clustering algorithms can then be applied on this modified network. We also propose a novel algorithm that searches for cliques in the modified network, and merge cliques to form clusters using a "partial clique merging" method. In this paper, we show that 1) the use of indirect interactions and topological weight to augment protein-protein interactions can be used to improve the precision of clusters predicted by various existing clustering algorithms; 2) our complex finding algorithm performs very well on interaction networks modified in this way. Since no any other information except the original PPI network is used, our approach would be very useful for protein complex prediction, especially for prediction of novel protein complexes.


Subject(s)
Models, Biological , Models, Chemical , Protein Interaction Mapping/methods , Proteins/chemistry , Proteins/metabolism , Signal Transduction/physiology , Animals , Binding Sites , Computer Simulation , Humans , Protein Binding
SELECTION OF CITATIONS
SEARCH DETAIL
...