Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 23
Filter
Add more filters











Publication year range
2.
Front Bioinform ; 4: 1306244, 2024.
Article in English | MEDLINE | ID: mdl-38501111

ABSTRACT

Introduction: DNA methylation clocks presents advantageous characteristics with respect to the ambitious goal of identifying very early markers of disease, based on the concept that accelerated ageing is a reliable predictor in this sense. Methods: Such tools, being epigenomic based, are expected to be conditioned by sex and tissue specificities, and this work is about quantifying this dependency as well as that from the regression model and the size of the training set. Results: Our quantitative results indicate that elastic-net penalization is the best performing strategy, and better so when-unsurprisingly-the data set is bigger; sex does not appear to condition clocks performances and tissue specific clocks appear to perform better than generic blood clocks. Finally, when considering all trained clocks, we identified a subset of genes that, to the best of our knowledge, have not been presented yet and might deserve further investigation: CPT1A, MMP15, SHROOM3, SLIT3, and SYNGR. Conclusion: These factual starting points can be useful for the future medical translation of clocks and in particular in the debate between multi-tissue clocks, generally trained on a large majority of blood samples, and tissue-specific clocks.

3.
Bioinformatics ; 40(1)2024 01 02.
Article in English | MEDLINE | ID: mdl-38213002

ABSTRACT

MOTIVATION: methyLImp, a method we recently introduced for the missing value estimation of DNA methylation data, has demonstrated competitive performance in data imputation compared to the existing, general-purpose, approaches. However, imputation running time was considerably long and unfeasible in case of large datasets with numerous missing values. RESULTS: methyLImp2 made possible computations that were previously unfeasible. We achieved this by introducing two important modifications that have significantly reduced the original running time without sacrificing prediction performance. First, we implemented a chromosome-wise parallel version of methyLImp. This parallelization reduced the runtime by several 10-fold in our experiments. Then, to handle large datasets, we also introduced a mini-batch approach that uses only a subset of the samples for the imputation. Thus, it further reduces the running time from days to hours or even minutes in large datasets. AVAILABILITY AND IMPLEMENTATION: The R package methyLImp2 is under review for Bioconductor. It is currently freely available on Github https://github.com/annaplaksienko/methyLImp2.


Subject(s)
Computational Biology , DNA Methylation
5.
Brief Bioinform ; 23(4)2022 07 18.
Article in English | MEDLINE | ID: mdl-35794713

ABSTRACT

In recent years there has been a widespread interest in researching biomarkers of aging that could predict physiological vulnerability better than chronological age. Aging, in fact, is one of the most relevant risk factors for a wide range of maladies, and molecular surrogates of this phenotype could enable better patients stratification. Among the most promising of such biomarkers is DNA methylation-based biological age. Given the potential and variety of computational implementations (epigenetic clocks), we here present a systematic review of such clocks. Furthermore, we provide a large-scale performance comparison across different tissues and diseases in terms of age prediction accuracy and age acceleration, a measure of deviance from physiology. Our analysis offers both a state-of-the-art overview of the computational techniques developed so far and a heterogeneous picture of performances, which can be helpful in orienting future research.


Subject(s)
DNA Methylation , Epigenesis, Genetic , Biomarkers , Epigenomics/methods
6.
Nucleic Acids Res ; 49(W1): W199-W206, 2021 07 02.
Article in English | MEDLINE | ID: mdl-34038548

ABSTRACT

Methylage is an epigenetic marker of biological age that exploits the correlation between the methylation state of specific CG dinucleotides (CpGs) and chronological age (in years), gestational age (in weeks), cellular age (in cell cycles or as telomere length, in kilobases). Using DNA methylation data, methylage is measurable via the so called epigenetic clocks. Importantly, alterations of the correlation between methylage and age (age acceleration or deceleration) have been stably associated with pathological states and occur long before clinical signs of diseases become overt, making epigenetic clocks a potentially disruptive tool in preventive, diagnostic and also in forensic applications. Nevertheless, methylage dependency from CpGs selection, mathematical modelling, tissue specificity and age range, still makes the potential of this biomarker limited. In order to enhance model comparisons, interchange, availability, robustness and standardization, we organized a selected set of clocks within a hub webservice, EstimAge (Estimate of methylation Age, http://estimage.iac.rm.cnr.it), which intuitively and informatively enables quick identification, computation and comparison of available clocks, with the support of standard statistics.


Subject(s)
DNA Methylation , Software , CpG Islands , Epigenesis, Genetic , Internet , Time Factors
7.
Bioinformatics ; 37(4): 506-513, 2021 05 01.
Article in English | MEDLINE | ID: mdl-32976564

ABSTRACT

MOTIVATION: Protein fold recognition is a key step for template-based modeling approaches to protein structure prediction. Although closely related folds can be easily identified by sequence homology search in sequence databases, fold recognition is notoriously more difficult when it involves the identification of distantly related homologs. Recent progress in residue-residue contact and distance prediction opens up the possibility of improving fold recognition by using structural information contained in predicted distance and contact maps. RESULTS: Here we propose to use the congruence coefficient as a metric of similarity between maps. We prove that this metric has several interesting mathematical properties which allow one to compute in polynomial time its exact mean and variance over all possible (exponentially many) alignments between two symmetric matrices, and assess the statistical significance of similarity between aligned maps. We perform fold recognition tests by recovering predicted target contact/distance maps from the two most recent Critical Assessment of Structure Prediction editions and over 27 000 non-homologous structural templates from the ECOD database. On this large benchmark, we compare fold recognition performances of different alignment tools with their own similarity scores against those obtained using the congruence coefficient. We show that the congruence coefficient overall improves fold recognition over other methods, proving its effectiveness as a general similarity metric for protein map comparison. AVAILABILITY AND IMPLEMENTATION: The congruence coefficient software CCpro is available as part of the SCRATCH suite at: http://scratch.proteomics.ics.uci.edu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Proteins , Software , Algorithms , Computational Biology , Databases, Nucleic Acid , Sequence Alignment
8.
PLoS One ; 15(3): e0229763, 2020.
Article in English | MEDLINE | ID: mdl-32155174

ABSTRACT

INTRODUCTION: Meta-analysis is a powerful means for leveraging the hundreds of experiments being run worldwide into more statistically powerful analyses. This is also true for the analysis of omic data, including genome-wide DNA methylation. In particular, thousands of DNA methylation profiles generated using the Illumina 450k are stored in the publicly accessible Gene Expression Omnibus (GEO) repository. Often, however, the intensity values produced by the BeadChip (raw data) are not deposited, therefore only pre-processed values -obtained after computational manipulation- are available. Pre-processing is possibly different among studies and may then affect meta-analysis by introducing non-biological sources of variability. MATERIAL AND METHODS: To systematically investigate the effect of pre-processing on meta-analysis, we analysed four different collections of DNA methylation samples (datasets), each composed of two subsets, for which raw data from controls (i.e. healthy subjects) and cases (i.e. patients) are available. We pre-processed the data from each dataset with nine among the most common pipelines found in literature. Moreover, we evaluated the performance of regRCPqn, a modification of the RCP algorithm that aims to improve data consistency. For each combination of pre-processing (9 × 9), we first evaluated the between-sample variability among control subjects and, then, we identified genomic positions that are differentially methylated between cases and controls (differential analysis). RESULTS AND CONCLUSION: The pre-processing of DNA methylation data affects both the between-sample variability and the loci identified as differentially methylated, and the effects of pre-processing are strongly dataset-dependent. By contrast, application of our renormalization algorithm regRCPqn: (i) reduces variability and (ii) increases agreement between meta-analysed datasets, both critical components of data harmonization.


Subject(s)
DNA Methylation , High-Throughput Nucleotide Sequencing/standards , Meta-Analysis as Topic , Sequence Analysis, DNA/standards , Animals , High-Throughput Nucleotide Sequencing/methods , Humans , Sequence Analysis, DNA/methods , Software/standards
9.
J Proteome Res ; 19(7): 2873-2878, 2020 07 02.
Article in English | MEDLINE | ID: mdl-31971806

ABSTRACT

Omics techniques provide a spectrum of information at the genomic level, whose analysis can characterize complex traits at a molecular level. The relationship among genotype and phenotype implies that from genome information the molecular pathways and biological processes underlying a given phenotype are discovered. In dealing with this problem, gene enrichment analysis has become the most widely adopted strategy. Here we present NETGE-PLUS, a Web server for standard and network-based functional interpretation of gene sets of human and of model organisms, including Sus scrofa, Saccharomyces cerevisiae, Escherichia coli, and Arabidopsis thaliana. NETGE-PLUS enables the functional enrichment of both simple and ranked lists of genes, introducing also the possibility of exploring relationships among KEGG pathways. A Web interface makes data retrieval complete and user-friendly. NETGE-PLUS is publicly available at http://net-ge2.biocomp.unibo.it.


Subject(s)
Arabidopsis , Software , Arabidopsis/genetics , Databases, Genetic , Genomics , Humans , Information Storage and Retrieval , Internet , Probability
10.
Bioinformatics ; 35(19): 3786-3793, 2019 10 01.
Article in English | MEDLINE | ID: mdl-30796811

ABSTRACT

MOTIVATION: DNA methylation is a stable epigenetic mark with major implications in both physiological (development, aging) and pathological conditions (cancers and numerous diseases). Recent research involving methylation focuses on the development of molecular age estimation methods based on DNA methylation levels (mAge). An increasing number of studies indicate that divergences between mAge and chronological age may be associated to age-related diseases. Current advances in high-throughput technologies have allowed the characterization of DNA methylation levels throughout the human genome. However, experimental methylation profiles often contain multiple missing values that can affect the analysis of the data and also mAge estimation. Although several imputation methods exist, a major deficiency lies in the inability to cope with large datasets, such as DNA methylation chips. Specific methods for imputing missing methylation data are therefore needed. RESULTS: We present a simple and computationally efficient imputation method, metyhLImp, based on linear regression. The rationale of the approach lies in the observation that methylation levels show a high degree of inter-sample correlation. We performed a comparative study of our approach with other imputation methods on DNA methylation data of healthy and disease samples from different tissues. Performances have been assessed both in terms of imputation accuracy and in terms of the impact imputed values have on mAge estimation. In comparison to existing methods, our linear regression model proves to perform equally or better and with good computational efficiency. The results of our analysis provide recommendations for accurate estimation of missing methylation values. AVAILABILITY AND IMPLEMENTATION: The R-package methyLImp is freely available at https://github.com/pdilena/methyLImp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
DNA Methylation , Epigenomics , Humans , Linear Models , Oligonucleotide Array Sequence Analysis , Research Design
11.
Bioinformatics ; 32(22): 3489-3491, 2016 11 15.
Article in English | MEDLINE | ID: mdl-27485441

ABSTRACT

MOTIVATION: Gene enrichment is a requisite for the interpretation of biological complexity related to specific molecular pathways and biological processes. Furthermore, when interpreting NGS data and human variations, including those related to pathologies, gene enrichment allows the inclusion of other genes that in the human interactome space may also play important key roles in the emergency of the phenotype. Here, we describe NET-GE, a web server for associating biological processes and pathways to sets of human proteins involved in the same phenotype RESULTS: NET-GE is based on protein-protein interaction networks, following the notion that for a set of proteins, the context of their specific interactions can better define their function and the processes they can be related to in the biological complexity of the cell. Our method is suited to extract statistically validated enriched terms from Gene Ontology, KEGG and REACTOME annotation databases. Furthermore, NET-GE is effective even when the number of input proteins is small. AVAILABILITY AND IMPLEMENTATION: NET-GE web server is publicly available and accessible at http://net-ge.biocomp.unibo.it/enrich CONTACT: gigi@biocomp.unibo.itSupplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Genes , Software , Databases, Factual , Humans , Internet , Proteins/genetics , Proteins/metabolism
12.
BMC Bioinformatics ; 16: 346, 2015 Oct 28.
Article in English | MEDLINE | ID: mdl-26511083

ABSTRACT

BACKGROUND: Functional annotation of genes and gene products is a major challenge in the post-genomic era. Nowadays, gene function curation is largely based on manual assignment of Gene Ontology (GO) annotations to genes by using published literature. The annotation task is extremely time-consuming, therefore there is an increasing interest in automated tools that can assist human experts. RESULTS: Here we introduce GOTA, a GO term annotator for biomedical literature. The proposed approach makes use only of information that is readily available from public repositories and it is easily expandable to handle novel sources of information. We assess the classification capabilities of GOTA on a large benchmark set of publications. The overall performances are encouraging in comparison to the state of the art in multi-label classification over large taxonomies. Furthermore, the experimental tests provide some interesting insights into the potential improvement of automated annotation tools. CONCLUSIONS: GOTA implements a flexible and expandable model for GO annotation of biomedical literature. The current version of the GOTA tool is freely available at http://gota.apice.unibo.it.


Subject(s)
User-Computer Interface , Animals , Data Mining , Gene Ontology , Humans , Internet , Molecular Sequence Annotation
13.
Eukaryot Cell ; 14(11): 1114-26, 2015 Nov.
Article in English | MEDLINE | ID: mdl-26342020

ABSTRACT

Candida albicans is associated with humans as both a harmless commensal organism and a pathogen. Cph2 is a transcription factor whose DNA binding domain is similar to that of mammalian sterol response element binding proteins (SREBPs). SREBPs are master regulators of cellular cholesterol levels and are highly conserved from fungi to mammals. However, ergosterol biosynthesis is regulated by the zinc finger transcription factor Upc2 in C. albicans and several other yeasts. Cph2 is not necessary for ergosterol biosynthesis but is important for colonization in the murine gastrointestinal (GI) tract. Here we demonstrate that Cph2 is a membrane-associated transcription factor that is processed to release the N-terminal DNA binding domain like SREBPs, but its cleavage is not regulated by cellular levels of ergosterol or oxygen. Chromatin immunoprecipitation sequencing (ChIP-seq) shows that Cph2 binds to the promoters of HMS1 and other components of the regulatory circuit for GI tract colonization. In addition, 50% of Cph2 targets are also bound by Hms1 and other factors of the regulatory circuit. Several common targets function at the head of the glycolysis pathway. Thus, Cph2 is an integral part of the regulatory circuit for GI colonization that regulates glycolytic flux. Transcriptome sequencing (RNA-seq) shows a significant overlap in genes differentially regulated by Cph2 and hypoxia, and Cph2 is important for optimal expression of some hypoxia-responsive genes in glycolysis and the citric acid cycle. We suggest that Cph2 and Upc2 regulate hypoxia-responsive expression in different pathways, consistent with a synthetic lethal defect of the cph2 upc2 double mutant in hypoxia.


Subject(s)
Basic Helix-Loop-Helix Transcription Factors/genetics , Candida albicans/genetics , Fungal Proteins/genetics , Gene Expression Regulation, Fungal , Base Sequence , Basic Helix-Loop-Helix Transcription Factors/metabolism , Candida albicans/metabolism , Candida albicans/pathogenicity , Fungal Proteins/metabolism , Molecular Sequence Data , Protein Binding , Response Elements , Transcriptome , Virulence/genetics
14.
BMC Genomics ; 16 Suppl 8: S6, 2015.
Article in English | MEDLINE | ID: mdl-26110971

ABSTRACT

BACKGROUND: Enrichment analysis is a widely applied procedure for shedding light on the molecular mechanisms and functions at the basis of phenotypes, for enlarging the dataset of possibly related genes/proteins and for helping interpretation and prioritization of newly determined variations. Several standard and Network-based enrichment methods are available. Both approaches rely on the annotations that characterize the genes/proteins included in the input set; network based ones also include in different ways physical and functional relationships among different genes or proteins that can be extracted from the available biological networks of interactions. RESULTS: Here we describe a novel procedure based on the extraction from the STRING interactome of sub-networks connecting proteins that share the same Gene Ontology(GO) terms for Biological Process (BP). Enrichment analysis is performed by mapping the protein set to be analyzed on the sub-networks, and then by collecting the corresponding annotations. We test the ability of our enrichment method in finding annotation terms disregarded by other enrichment methods available. We benchmarked 244 sets of proteins associated to different Mendelian diseases, according to the OMIM web resource. In 143 cases (58%), the network-based procedure extracts GO terms neglected by the standard method, and in 86 cases (35%), some of the newly enriched GO terms are not included in the set of annotations characterizing the input proteins. We present in detail six cases where our network-based enrichment provides an insight into the biological basis of the diseases, outperforming other freely available network-based methods. CONCLUSIONS: Considering a set of proteins in the context of their interaction network can help in better defining their functions. Our novel method exploits the information contained in the STRING database for building the minimal connecting network containing all the proteins annotated with the same GO term. The enrichment procedure is performed considering the GO-specific network modules and, when tested on the OMIM-derived benchmark sets, it is able to extract enrichment terms neglected by other methods. Our procedure is effective even when the size of the input protein set is small, requiring at least two input proteins.


Subject(s)
Biological Phenomena , Databases, Genetic , Gene Regulatory Networks , Computational Biology , Humans , Mendelian Randomization Analysis , Proteins/genetics , Proteins/metabolism , Software
15.
Bioinformatics ; 31(7): 1053-9, 2015 Apr 01.
Article in English | MEDLINE | ID: mdl-25429059

ABSTRACT

MOTIVATION: Mechanotransduction--the ability to output a biochemical signal from a mechanical input--is related to the initiation and progression of a broad spectrum of molecular events. Yet, the characterization of mechanotransduction lacks some of the most basic tools as, for instance, it can hardly be recognized by enrichment analysis tools, nor could we find any pathway representation. This greatly limits computational testing and hypothesis generation on mechanotransduction biological relevance and involvement in disease or physiological mechanisms. RESULTS: We here present a molecular map of mechanotransduction, built in CellDesigner to warrant that maximum information is embedded in a compact network format. To validate the map's necessity we tested its redundancy in comparison with existing pathways, and to estimate its sufficiency, we quantified its ability to reproduce biological events with dynamic simulations, using Signaling Petri Networks. AVAILABILITY AND IMPLEMENTATION: SMBL language map is available in the Supplementary Data: core_map.xml, basic_map.xml. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Genes/genetics , Mechanotransduction, Cellular , Metabolic Networks and Pathways , Models, Biological , Signal Transduction/physiology , Software , Autoimmunity/genetics , Computer Simulation , Humans
16.
BMC Bioinformatics ; 14: 159, 2013 May 15.
Article in English | MEDLINE | ID: mdl-23672344

ABSTRACT

BACKGROUND: Molecular pathways represent an ensemble of interactions occurring among molecules within the cell and between cells. The identification of similarities between molecular pathways across organisms and functions has a critical role in understanding complex biological processes. For the inference of such novel information, the comparison of molecular pathways requires to account for imperfect matches (flexibility) and to efficiently handle complex network topologies. To date, these characteristics are only partially available in tools designed to compare molecular interaction maps. RESULTS: Our approach MIMO (Molecular Interaction Maps Overlap) addresses the first problem by allowing the introduction of gaps and mismatches between query and template pathways and permits -when necessary- supervised queries incorporating a priori biological information. It then addresses the second issue by relying directly on the rich graph topology described in the Systems Biology Markup Language (SBML) standard, and uses multidigraphs to efficiently handle multiple queries on biological graph databases. The algorithm has been here successfully used to highlight the contact point between various human pathways in the Reactome database. CONCLUSIONS: MIMO offers a flexible and efficient graph-matching tool for comparing complex biological pathways.


Subject(s)
Metabolic Networks and Pathways , Signal Transduction , Software , Algorithms , Amino Acids/metabolism , Citric Acid Cycle , Computer Graphics , Databases, Factual , Electron Transport , Humans , Mitosis , Systems Biology , Wnt Signaling Pathway
17.
Bioinformatics ; 28(19): 2449-57, 2012 Oct 01.
Article in English | MEDLINE | ID: mdl-22847931

ABSTRACT

MOTIVATION: Residue-residue contact prediction is important for protein structure prediction and other applications. However, the accuracy of current contact predictors often barely exceeds 20% on long-range contacts, falling short of the level required for ab initio structure prediction. RESULTS: Here, we develop a novel machine learning approach for contact map prediction using three steps of increasing resolution. First, we use 2D recursive neural networks to predict coarse contacts and orientations between secondary structure elements. Second, we use an energy-based method to align secondary structure elements and predict contact probabilities between residues in contacting alpha-helices or strands. Third, we use a deep neural network architecture to organize and progressively refine the prediction of contacts, integrating information over both space and time. We train the architecture on a large set of non-redundant proteins and test it on a large set of non-homologous domains, as well as on the set of protein domains used for contact prediction in the two most recent CASP8 and CASP9 experiments. For long-range contacts, the accuracy of the new CMAPpro predictor is close to 30%, a significant increase over existing approaches. AVAILABILITY: CMAPpro is available as part of the SCRATCH suite at http://scratch.proteomics.ics.uci.edu/. CONTACT: pfbaldi@uci.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Artificial Intelligence , Computational Biology/methods , Neural Networks, Computer , Proteins/chemistry , Algorithms , Protein Structure, Secondary , Protein Structure, Tertiary
18.
PLoS One ; 6(9): e23691, 2011.
Article in English | MEDLINE | ID: mdl-21912641

ABSTRACT

B-cell leukemia/lymphoma 11B (Bcl11b) is a transcription factor showing predominant expression in the striatum. To date, there are no known gene targets of Bcl11b in the nervous system. Here, we define targets for Bcl11b in striatal cells by performing chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) in combination with genome-wide expression profiling. Transcriptome-wide analysis revealed that 694 genes were significantly altered in striatal cells over-expressing Bcl11b, including genes showing striatal-enriched expression similar to Bcl11b. ChIP-seq analysis demonstrated that Bcl11b bound a mixture of coding and non-coding sequences that were within 10 kb of the transcription start site of an annotated gene. Integrating all ChIP-seq hits with the microarray expression data, 248 direct targets of Bcl11b were identified. Functional analysis on the integrated gene target list identified several zinc-finger encoding genes as Bcl11b targets, and further revealed a significant association of Bcl11b to brain-derived neurotrophic factor/neurotrophin signaling. Analysis of ChIP-seq binding regions revealed significant consensus DNA binding motifs for Bcl11b. These data implicate Bcl11b as a novel regulator of the BDNF signaling pathway, which is disrupted in many neurological disorders. Specific targeting of the Bcl11b-DNA interaction could represent a novel therapeutic approach to lowering BDNF signaling specifically in striatal cells.


Subject(s)
Brain-Derived Neurotrophic Factor/genetics , Brain-Derived Neurotrophic Factor/metabolism , Genomics , Repressor Proteins/metabolism , Signal Transduction , Tumor Suppressor Proteins/metabolism , Animals , Base Sequence , Binding Sites , Chromatin Immunoprecipitation , Consensus Sequence , DNA/metabolism , Gene Expression Profiling , High-Throughput Nucleotide Sequencing , Mice , Neostriatum/cytology , Neostriatum/metabolism , Oligonucleotide Array Sequence Analysis , Signal Transduction/genetics
19.
BioData Min ; 4(1): 1, 2011 Jan 13.
Article in English | MEDLINE | ID: mdl-21232136

ABSTRACT

BACKGROUND: The present knowledge of protein structures at atomic level derives from some 60,000 molecules. Yet the exponential ever growing set of hypothetical protein sequences comprises some 10 million chains and this makes the problem of protein structure prediction one of the challenging goals of bioinformatics. In this context, the protein representation with contact maps is an intermediate step of fold recognition and constitutes the input of contact map predictors. However contact map representations require fast and reliable methods to reconstruct the specific folding of the protein backbone. METHODS: In this paper, by adopting a GRID technology, our algorithm for 3D reconstruction FT-COMAR is benchmarked on a huge set of non redundant proteins (1716) taking random noise into consideration and this makes our computation the largest ever performed for the task at hand. RESULTS: We can observe the effects of introducing random noise on 3D reconstruction and derive some considerations useful for future implementations. The dimension of the protein set allows also statistical considerations after grouping per SCOP structural classes. CONCLUSIONS: All together our data indicate that the quality of 3D reconstruction is unaffected by deleting up to an average 75% of the real contacts while only few percentage of randomly generated contacts in place of non-contacts are sufficient to hamper 3D reconstruction.

20.
Article in English | MEDLINE | ID: mdl-20855922

ABSTRACT

Correlated mutations in proteins are believed to occur in order to preserve the protein functional folding through evolution. Their values can be deduced from sequence and/or structural alignments and are indicative of residue contacts in the protein three-dimensional structure. A correlation among pairs of residues is routinely evaluated with the Pearson correlation coefficient and the MCLACHLAN similarity matrix. In literature, there is no justification for the adoption of the MCLACHLAN instead of other substitution matrices. In this paper, we approach the problem of computing the optimal similarity matrix for contact prediction with correlated mutations, i.e., the similarity matrix that maximizes the accuracy of contact prediction with correlated mutations. We describe an optimization procedure, based on the gradient descent method, for computing the optimal similarity matrix and perform an extensive number of experimental tests. Our tests show that there is a large number of optimal matrices that perform similarly to MCLACHLAN. We also obtain that the upper limit to the accuracy achievable in protein contact prediction is independent of the optimized similarity matrix. This suggests that the poor scoring of the correlated mutations approach may be due to the choice of the linear correlation function in evaluating correlated mutations.


Subject(s)
Computational Biology/methods , Models, Statistical , Protein Interaction Domains and Motifs , Protein Interaction Mapping/methods , Proteins/chemistry , Algorithms , Databases, Protein , Mutation
SELECTION OF CITATIONS
SEARCH DETAIL