Search | VHL Regional Portal

1.

Improving classification of correct and incorrect protein-protein docking models by augmenting the training set.

Barradas-Bautista, Didier; Almajed, Ali; Oliva, Romina; Kalnis, Panos; Cavallo, Luigi.

Bioinform Adv ; 3(1): vbad012, 2023.

Article in English | MEDLINE | ID: mdl-36789292

ABSTRACT

Motivation: Protein-protein interactions drive many relevant biological events, such as infection, replication and recognition. To control or engineer such events, we need to access the molecular details of the interaction provided by experimental 3D structures. However, such experiments take time and are expensive; moreover, the current technology cannot keep up with the high discovery rate of new interactions. Computational modeling, like protein-protein docking, can help to fill this gap by generating docking poses. Protein-protein docking generally consists of two parts, sampling and scoring. The sampling is an exhaustive search of the tridimensional space. The caveat of the sampling is that it generates a large number of incorrect poses, producing a highly unbalanced dataset. This limits the utility of the data to train machine learning classifiers. Results: Using weak supervision, we developed a data augmentation method that we named hAIkal. Using hAIkal, we increased the labeled training data to train several algorithms. We trained and obtained different classifiers; the best classifier has 81% accuracy and 0.51 Matthews' correlation coefficient on the test set, surpassing the state-of-the-art scoring functions. Availability and implementation: Docking models from Benchmark 5 are available at https://doi.org/10.5281/zenodo.4012018. Processed tabular data are available at https://repository.kaust.edu.sa/handle/10754/666961. Google colab is available at https://colab.research.google.com/drive/1vbVrJcQSf6\_C3jOAmZzgQbTpuJ5zC1RP?usp=sharing. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

2.

Discriminative identification of transcriptional responses of promoters and enhancers after stimulus.

Kleftogiannis, Dimitrios; Kalnis, Panos; Arner, Erik; Bajic, Vladimir B.

Nucleic Acids Res ; 45(4): e25, 2017 02 28.

Article in English | MEDLINE | ID: mdl-27789687

ABSTRACT

Promoters and enhancers regulate the initiation of gene expression and maintenance of expression levels in spatial and temporal manner. Recent findings stemming from the Cap Analysis of Gene Expression (CAGE) demonstrate that promoters and enhancers, based on their expression profiles after stimulus, belong to different transcription response subclasses. One of the most promising biological features that might explain the difference in transcriptional response between subclasses is the local chromatin environment. We introduce a novel computational framework, PEDAL, for distinguishing effectively transcriptional profiles of promoters and enhancers using solely histone modification marks, chromatin accessibility and binding sites of transcription factors and co-activators. A case study on data from MCF-7 cell-line reveals that PEDAL can identify successfully the transcription response subclasses of promoters and enhancers from two different stimulations. Moreover, we report subsets of input markers that discriminate with minimized classification error MCF-7 promoter and enhancer transcription response subclasses. Our work provides a general computational approach for identifying effectively cell-specific and stimulation-specific promoter and enhancer transcriptional profiles, and thus, contributes to improve our understanding of transcriptional activation in human.

Subject(s)

Computational Biology/methods , Enhancer Elements, Genetic , Promoter Regions, Genetic , Transcription, Genetic , Algorithms , Chromatin/genetics , Epidermal Growth Factor/pharmacology , Gene Expression Profiling , Gene Expression Regulation/drug effects , Humans , MCF-7 Cells , Protein Binding , Transcription Factors , Transcriptional Activation , Workflow

3.

DRABAL: novel method to mine large high-throughput screening assays using Bayesian active learning.

Soufan, Othman; Ba-Alawi, Wail; Afeef, Moataz; Essack, Magbubah; Kalnis, Panos; Bajic, Vladimir B.

J Cheminform ; 8: 64, 2016.

Article in English | MEDLINE | ID: mdl-27895719

ABSTRACT

BACKGROUND: Mining high-throughput screening (HTS) assays is key for enhancing decisions in the area of drug repositioning and drug discovery. However, many challenges are encountered in the process of developing suitable and accurate methods for extracting useful information from these assays. Virtual screening and a wide variety of databases, methods and solutions proposed to-date, did not completely overcome these challenges. This study is based on a multi-label classification (MLC) technique for modeling correlations between several HTS assays, meaning that a single prediction represents a subset of assigned correlated labels instead of one label. Thus, the devised method provides an increased probability for more accurate predictions of compounds that were not tested in particular assays. RESULTS: Here we present DRABAL, a novel MLC solution that incorporates structure learning of a Bayesian network as a step to model dependency between the HTS assays. In this study, DRABAL was used to process more than 1.4 million interactions of over 400,000 compounds and analyze the existing relationships between five large HTS assays from the PubChem BioAssay Database. Compared to different MLC methods, DRABAL significantly improves the F1Score by about 22%, on average. We further illustrated usefulness and utility of DRABAL through screening FDA approved drugs and reported ones that have a high probability to interact with several targets, thus enabling drug-multi-target repositioning. Specifically DRABAL suggests the Thiabendazole drug as a common activator of the NCP1 and Rab-9A proteins, both of which are designed to identify treatment modalities for the Niemann-Pick type C disease. CONCLUSION: We developed a novel MLC solution based on a Bayesian active learning framework to overcome the challenge of lacking fully labeled training data and exploit actual dependencies between the HTS assays. The solution is motivated by the need to model dependencies between existing experimental confirmatory HTS assays and improve prediction performance. We have pursued extensive experiments over several HTS assays and have shown the advantages of DRABAL. The datasets and programs can be downloaded from https://figshare.com/articles/DRABAL/3309562.Graphical abstract.

4.

DASPfind: new efficient method to predict drug-target interactions.

Ba-Alawi, Wail; Soufan, Othman; Essack, Magbubah; Kalnis, Panos; Bajic, Vladimir B.

J Cheminform ; 8: 15, 2016.

Article in English | MEDLINE | ID: mdl-26985240

ABSTRACT

BACKGROUND: Identification of novel drug-target interactions (DTIs) is important for drug discovery. Experimental determination of such DTIs is costly and time consuming, hence it necessitates the development of efficient computational methods for the accurate prediction of potential DTIs. To-date, many computational methods have been proposed for this purpose, but they suffer the drawback of a high rate of false positive predictions. RESULTS: Here, we developed a novel computational DTI prediction method, DASPfind. DASPfind uses simple paths of particular lengths inferred from a graph that describes DTIs, similarities between drugs, and similarities between the protein targets of drugs. We show that on average, over the four gold standard DTI datasets, DASPfind significantly outperforms other existing methods when the single top-ranked predictions are considered, resulting in 46.17 % of these predictions being correct, and it achieves 49.22 % correct single top ranked predictions when the set of all DTIs for a single drug is tested. Furthermore, we demonstrate that our method is best suited for predicting DTIs in cases of drugs with no known targets or with few known targets. We also show the practical use of DASPfind by generating novel predictions for the Ion Channel dataset and validating them manually. CONCLUSIONS: DASPfind is a computational method for finding reliable new interactions between drugs and proteins. We show over six different DTI datasets that DASPfind outperforms other state-of-the-art methods when the single top-ranked predictions are considered, or when a drug with no known targets or with few known targets is considered. We illustrate the usefulness and practicality of DASPfind by predicting novel DTIs for the Ion Channel dataset. The validated predictions suggest that DASPfind can be used as an efficient method to identify correct DTIs, thus reducing the cost of necessary experimental verifications in the process of drug discovery. DASPfind can be accessed online at: http://www.cbrc.kaust.edu.sa/daspfind.Graphical abstractThe conceptual workflow for predicting drug-target interactions using DASPfind.

5.

Progress and challenges in bioinformatics approaches for enhancer identification.

Kleftogiannis, Dimitrios; Kalnis, Panos; Bajic, Vladimir B.

Brief Bioinform ; 17(6): 967-979, 2016 11.

Article in English | MEDLINE | ID: mdl-26634919

ABSTRACT

Enhancers are cis-acting DNA elements that play critical roles in distal regulation of gene expression. Identifying enhancers is an important step for understanding distinct gene expression programs that may reflect normal and pathogenic cellular conditions. Experimental identification of enhancers is constrained by the set of conditions used in the experiment. This requires multiple experiments to identify enhancers, as they can be active under specific cellular conditions but not in different cell types/tissues or cellular states. This has opened prospects for computational prediction methods that can be used for high-throughput identification of putative enhancers to complement experimental approaches. Potential functions and properties of predicted enhancers have been catalogued and summarized in several enhancer-oriented databases. Because the current methods for the computational prediction of enhancers produce significantly different enhancer predictions, it will be beneficial for the research community to have an overview of the strategies and solutions developed in this field. In this review, we focus on the identification and analysis of enhancers by bioinformatics approaches. First, we describe a general framework for computational identification of enhancers, present relevant data types and discuss possible computational solutions. Next, we cover over 30 existing computational enhancer identification methods that were developed since 2000. Our review highlights advantages, limitations and potentials, while suggesting pragmatic guidelines for development of more efficient computational enhancer prediction methods. Finally, we discuss challenges and open problems of this topic, which require further consideration.

Subject(s)

Computational Biology , Enhancer Elements, Genetic , Histones

6.

Mining Chemical Activity Status from High-Throughput Screening Assays.

Soufan, Othman; Ba-alawi, Wail; Afeef, Moataz; Essack, Magbubah; Rodionov, Valentin; Kalnis, Panos; Bajic, Vladimir B.

PLoS One ; 10(12): e0144426, 2015.

Article in English | MEDLINE | ID: mdl-26658480

ABSTRACT

High-throughput screening (HTS) experiments provide a valuable resource that reports biological activity of numerous chemical compounds relative to their molecular targets. Building computational models that accurately predict such activity status (active vs. inactive) in specific assays is a challenging task given the large volume of data and frequently small proportion of active compounds relative to the inactive ones. We developed a method, DRAMOTE, to predict activity status of chemical compounds in HTP activity assays. For a class of HTP assays, our method achieves considerably better results than the current state-of-the-art-solutions. We achieved this by modification of a minority oversampling technique. To demonstrate that DRAMOTE is performing better than the other methods, we performed a comprehensive comparison analysis with several other methods and evaluated them on data from 11 PubChem assays through 1,350 experiments that involved approximately 500,000 interactions between chemicals and their target proteins. As an example of potential use, we applied DRAMOTE to develop robust models for predicting FDA approved drugs that have high probability to interact with the thyroid stimulating hormone receptor (TSHR) in humans. Our findings are further partially and indirectly supported by 3D docking results and literature information. The results based on approximately 500,000 interactions suggest that DRAMOTE has performed the best and that it can be used for developing robust virtual screening models. The datasets and implementation of all solutions are available as a MATLAB toolbox online at www.cbrc.kaust.edu.sa/dramote and can be found on Figshare.

Subject(s)

High-Throughput Screening Assays/methods , Databases, Chemical , Humans , Receptors, Thyrotropin/drug effects

7.

Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data.

Allam, Amin; Kalnis, Panos; Solovyev, Victor.

Bioinformatics ; 31(21): 3421-8, 2015 Nov 01.

Article in English | MEDLINE | ID: mdl-26177965

ABSTRACT

MOTIVATION: Next-generation sequencing generates large amounts of data affected by errors in the form of substitutions, insertions or deletions of bases. Error correction based on the high-coverage information, typically improves de novo assembly. Most existing tools can correct substitution errors only; some support insertions and deletions, but accuracy in many cases is low. RESULTS: We present Karect, a novel error correction technique based on multiple alignment. Our approach supports substitution, insertion and deletion errors. It can handle non-uniform coverage as well as moderately covered areas of the sequenced genome. Experiments with data from Illumina, 454 FLX and Ion Torrent sequencing machines demonstrate that Karect is more accurate than previous methods, both in terms of correcting individual-bases errors (up to 10% increase in accuracy gain) and post de novo assembly quality (up to 10% increase in NGA50). We also introduce an improved framework for evaluating the quality of error correction. AVAILABILITY AND IMPLEMENTATION: Karect is available at: http://aminallam.github.io/karect. CONTACT: amin.allam@kaust.edu.sa SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , High-Throughput Nucleotide Sequencing/methods , INDEL Mutation/genetics , Mutagenesis, Insertional/genetics , Sequence Analysis, DNA/methods , Sequence Deletion , Chromosome Mapping , Computational Biology/methods , Genome, Human , Humans

8.

Hi-Jack: a novel computational framework for pathway-based inference of host-pathogen interactions.

Kleftogiannis, Dimitrios; Wong, Limsoon; Archer, John A C; Kalnis, Panos.

Bioinformatics ; 31(14): 2332-9, 2015 Jul 15.

Article in English | MEDLINE | ID: mdl-25758402

ABSTRACT

MOTIVATION: Pathogens infect their host and hijack the host machinery to produce more progeny pathogens. Obligate intracellular pathogens, in particular, require resources of the host to replicate. Therefore, infections by these pathogens lead to alterations in the metabolism of the host, shifting in favor of pathogen protein production. Some computational identification of mechanisms of host-pathogen interactions have been proposed, but it seems the problem has yet to be approached from the metabolite-hijacking angle. RESULTS: We propose a novel computational framework, Hi-Jack, for inferring pathway-based interactions between a host and a pathogen that relies on the idea of metabolite hijacking. Hi-Jack searches metabolic network data from hosts and pathogens, and identifies candidate reactions where hijacking occurs. A novel scoring function ranks candidate hijacked reactions and identifies pathways in the host that interact with pathways in the pathogen, as well as the associated frequent hijacked metabolites. We also describe host-pathogen interaction principles that can be used in the future for subsequent studies. Our case study on Mycobacterium tuberculosis (Mtb) revealed pathways in human-e.g. carbohydrate metabolism, lipids metabolism and pathways related to amino acids metabolism-that are likely to be hijacked by the pathogen. In addition, we report interesting potential pathway interconnections between human and Mtb such as linkage of human fatty acid biosynthesis with Mtb biosynthesis of unsaturated fatty acids, or linkage of human pentose phosphate pathway with lipopolysaccharide biosynthesis in Mtb. AVAILABILITY AND IMPLEMENTATION: Datasets and codes are available at http://cloud.kaust.edu.sa/Pages/Hi-Jack.aspx

Subject(s)

Computational Biology/methods , Host-Pathogen Interactions , Metabolic Networks and Pathways , Metabolomics/methods , Mycobacterium tuberculosis/metabolism , Proteins/metabolism , Tuberculosis/metabolism , Algorithms , Humans , Tuberculosis/microbiology

9.

DWFS: a wrapper feature selection tool based on a parallel genetic algorithm.

Soufan, Othman; Kleftogiannis, Dimitrios; Kalnis, Panos; Bajic, Vladimir B.

PLoS One ; 10(2): e0117988, 2015.

Article in English | MEDLINE | ID: mdl-25719748

ABSTRACT

Many scientific problems can be formulated as classification tasks. Data that harbor relevant information are usually described by a large number of features. Frequently, many of these features are irrelevant for the class prediction. The efficient implementation of classification models requires identification of suitable combinations of features. The smaller number of features reduces the problem's dimensionality and may result in higher classification performance. We developed DWFS, a web-based tool that allows for efficient selection of features for a variety of problems. DWFS follows the wrapper paradigm and applies a search strategy based on Genetic Algorithms (GAs). A parallel GA implementation examines and evaluates simultaneously large number of candidate collections of features. DWFS also integrates various filtering methods that may be applied as a pre-processing step in the feature selection process. Furthermore, weights and parameters in the fitness function of GA can be adjusted according to the application requirements. Experiments using heterogeneous datasets from different biomedical applications demonstrate that DWFS is fast and leads to a significant reduction of the number of features without sacrificing performance as compared to several widely used existing methods. DWFS can be accessed online at www.cbrc.kaust.edu.sa/dwfs.

Subject(s)

Data Mining/methods , Genomics/methods , Software , Databases, Genetic , Selection Bias

10.

DEEP: a general computational framework for predicting enhancers.

Kleftogiannis, Dimitrios; Kalnis, Panos; Bajic, Vladimir B.

Nucleic Acids Res ; 43(1): e6, 2015 Jan.

Article in English | MEDLINE | ID: mdl-25378307

ABSTRACT

Transcription regulation in multicellular eukaryotes is orchestrated by a number of DNA functional elements located at gene regulatory regions. Some regulatory regions (e.g. enhancers) are located far away from the gene they affect. Identification of distal regulatory elements is a challenge for the bioinformatics research. Although existing methodologies increased the number of computationally predicted enhancers, performance inconsistency of computational models across different cell-lines, class imbalance within the learning sets and ad hoc rules for selecting enhancer candidates for supervised learning, are some key questions that require further examination. In this study we developed DEEP, a novel ensemble prediction framework. DEEP integrates three components with diverse characteristics that streamline the analysis of enhancer's properties in a great variety of cellular conditions. In our method we train many individual classification models that we combine to classify DNA regions as enhancers or non-enhancers. DEEP uses features derived from histone modification marks or attributes coming from sequence characteristics. Experimental results indicate that DEEP performs better than four state-of-the-art methods on the ENCODE data. We report the first computational enhancer prediction results on FANTOM5 data where DEEP achieves 90.2% accuracy and 90% geometric mean (GM) of specificity and sensitivity across 36 different tissues. We further present results derived using in vivo-derived enhancer data from VISTA database. DEEP-VISTA, when tested on an independent test set, achieved GM of 80.1% and accuracy of 89.64%. DEEP framework is publicly available at http://cbrc.kaust.edu.sa/deep/.

Subject(s)

Enhancer Elements, Genetic , Sequence Analysis, DNA/methods , Chromatin Immunoprecipitation/methods , Genomics/methods , HeLa Cells , Histones/metabolism , Humans , K562 Cells , Support Vector Machine , Transcription Factors/metabolism

11.

Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures.

Kleftogiannis, Dimitrios; Kalnis, Panos; Bajic, Vladimir B.

PLoS One ; 8(9): e75505, 2013.

Article in English | MEDLINE | ID: mdl-24086547

ABSTRACT

A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.

Subject(s)

Computational Biology/methods , Genome, Bacterial/genetics , High-Throughput Nucleotide Sequencing/methods , Software

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL