Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 66
Filter
1.
Nat Commun ; 15(1): 5573, 2024 Jul 02.
Article in English | MEDLINE | ID: mdl-38956036

ABSTRACT

Recent advancements in genome assembly have greatly improved the prospects for comprehensive annotation of Transposable Elements (TEs). However, existing methods for TE annotation using genome assemblies suffer from limited accuracy and robustness, requiring extensive manual editing. In addition, the currently available gold-standard TE databases are not comprehensive, even for extensively studied species, highlighting the critical need for an automated TE detection method to supplement existing repositories. In this study, we introduce HiTE, a fast and accurate dynamic boundary adjustment approach designed to detect full-length TEs. The experimental results demonstrate that HiTE outperforms RepeatModeler2, the state-of-the-art tool, across various species. Furthermore, HiTE has identified numerous novel transposons with well-defined structures containing protein-coding domains, some of which are directly inserted within crucial genes, leading to direct alterations in gene expression. A Nextflow version of HiTE is also available, with enhanced parallelism, reproducibility, and portability.


Subject(s)
DNA Transposable Elements , Molecular Sequence Annotation , DNA Transposable Elements/genetics , Molecular Sequence Annotation/methods , Animals , Software , Humans , Reproducibility of Results , Computational Biology/methods , Databases, Genetic , Algorithms , Genome/genetics
2.
Bioinformatics ; 40(Supplement_1): i511-i520, 2024 Jun 28.
Article in English | MEDLINE | ID: mdl-38940121

ABSTRACT

MOTIVATION: Identifying cancer genes remains a significant challenge in cancer genomics research. Annotated gene sets encode functional associations among multiple genes, and cancer genes have been shown to cluster in hallmark signaling pathways and biological processes. The knowledge of annotated gene sets is critical for discovering cancer genes but remains to be fully exploited. RESULTS: Here, we present the DIsease-Specific Hypergraph neural network (DISHyper), a hypergraph-based computational method that integrates the knowledge from multiple types of annotated gene sets to predict cancer genes. First, our benchmark results demonstrate that DISHyper outperforms the existing state-of-the-art methods and highlight the advantages of employing hypergraphs for representing annotated gene sets. Second, we validate the accuracy of DISHyper-predicted cancer genes using functional validation results and multiple independent functional genomics data. Third, our model predicts 44 novel cancer genes, and subsequent analysis shows their significant associations with multiple types of cancers. Overall, our study provides a new perspective for discovering cancer genes and reveals previously undiscovered cancer genes. AVAILABILITY AND IMPLEMENTATION: DISHyper is freely available for download at https://github.com/genemine/DISHyper.


Subject(s)
Neoplasms , Neural Networks, Computer , Humans , Neoplasms/genetics , Computational Biology/methods , Genomics/methods , Genes, Neoplasm , Molecular Sequence Annotation/methods , Databases, Genetic
3.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38600667

ABSTRACT

Human leukocyte antigen (HLA) recognizes foreign threats and triggers immune responses by presenting peptides to T cells. Computationally modeling the binding patterns between peptide and HLA is very important for the development of tumor vaccines. However, it is still a big challenge to accurately predict HLA molecules binding peptides. In this paper, we develop a new model TripHLApan for predicting HLA molecules binding peptides by integrating triple coding matrix, BiGRU + Attention models, and transfer learning strategy. We have found the main interaction site regions between HLA molecules and peptides, as well as the correlation between HLA encoding and binding motifs. Based on the discovery, we make the preprocessing and coding closer to the natural biological process. Besides, due to the input being based on multiple types of features and the attention module focused on the BiGRU hidden layer, TripHLApan has learned more sequence level binding information. The application of transfer learning strategies ensures the accuracy of prediction results under special lengths (peptides in length 8) and model scalability with the data explosion. Compared with the current optimal models, TripHLApan exhibits strong predictive performance in various prediction environments with different positive and negative sample ratios. In addition, we validate the superiority and scalability of TripHLApan's predictive performance using additional latest data sets, ablation experiments and binding reconstitution ability in the samples of a melanoma patient. The results show that TripHLApan is a powerful tool for predicting the binding of HLA-I and HLA-II molecular peptides for the synthesis of tumor vaccines. TripHLApan is publicly available at https://github.com/CSUBioGroup/TripHLApan.git.


Subject(s)
Cancer Vaccines , Humans , Protein Binding , Peptides/chemistry , HLA Antigens/chemistry , Histocompatibility Antigens Class II/chemistry , Histocompatibility Antigens Class I/chemistry , Machine Learning
4.
Bioinformatics ; 39(9)2023 09 02.
Article in English | MEDLINE | ID: mdl-37606993

ABSTRACT

MOTIVATION: Cancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g. drug structure), without considering the relationships between drugs and biological entities (e.g. target, diseases, and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines. RESULTS: In this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multisource data to represent drugs and the interpretability of our model. AVAILABILITY AND IMPLEMENTATION: The codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.


Subject(s)
Deep Learning , Drug-Related Side Effects and Adverse Reactions , Humans , Algorithms , Cell Line , Machine Learning
5.
Commun Biol ; 6(1): 870, 2023 08 24.
Article in English | MEDLINE | ID: mdl-37620651

ABSTRACT

Adverse Drug Reactions (ADRs) have a direct impact on human health. As continuous pharmacovigilance and drug monitoring prove to be costly and time-consuming, computational methods have emerged as promising alternatives. However, most existing computational methods primarily focus on predicting whether or not the drug is associated with an adverse reaction and do not consider the core issue of drug benefit-risk assessment-whether the treatment outcome is serious when adverse drug reactions occur. To this end, we categorize serious clinical outcomes caused by adverse reactions to drugs into seven distinct classes and present a deep learning framework, so-called GCAP, for predicting the seriousness of clinical outcomes of adverse reactions to drugs. GCAP has two tasks: one is to predict whether adverse reactions to drugs cause serious clinical outcomes, and the other is to infer the corresponding classes of serious clinical outcomes. Experimental results demonstrate that our method is a powerful and robust framework with high extendibility. GCAP can serve as a useful tool to successfully address the challenge of predicting the seriousness of clinical outcomes stemming from adverse reactions to drugs.


Subject(s)
Deep Learning , Drug-Related Side Effects and Adverse Reactions , Humans , Drug-Related Side Effects and Adverse Reactions/diagnosis , Drug-Related Side Effects and Adverse Reactions/epidemiology , Drug-Related Side Effects and Adverse Reactions/etiology , Pancreas
6.
Bioinformatics ; 39(39 Suppl 1): i368-i376, 2023 06 30.
Article in English | MEDLINE | ID: mdl-37387178

ABSTRACT

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) offers a powerful tool to dissect the complexity of biological tissues through cell sub-population identification in combination with clustering approaches. Feature selection is a critical step for improving the accuracy and interpretability of single-cell clustering. Existing feature selection methods underutilize the discriminatory potential of genes across distinct cell types. We hypothesize that incorporating such information could further boost the performance of single cell clustering. RESULTS: We develop CellBRF, a feature selection method that considers genes' relevance to cell types for single-cell clustering. The key idea is to identify genes that are most important for discriminating cell types through random forests guided by predicted cell labels. Moreover, it proposes a class balancing strategy to mitigate the impact of unbalanced cell type distributions on feature importance evaluation. We benchmark CellBRF on 33 scRNA-seq datasets representing diverse biological scenarios and demonstrate that it substantially outperforms state-of-the-art feature selection methods in terms of clustering accuracy and cell neighborhood consistency. Furthermore, we demonstrate the outstanding performance of our selected features through three case studies on cell differentiation stage identification, non-malignant cell subtype identification, and rare cell identification. CellBRF provides a new and effective tool to boost single-cell clustering accuracy. AVAILABILITY AND IMPLEMENTATION: All source codes of CellBRF are freely available at https://github.com/xuyp-csu/CellBRF.


Subject(s)
Benchmarking , Random Forest , Cell Differentiation , Cluster Analysis
7.
J Biomed Inform ; 143: 104396, 2023 07.
Article in English | MEDLINE | ID: mdl-37211195

ABSTRACT

Automated ICD coding is a multi-label prediction task aiming at assigning patient diagnoses with the most relevant subsets of disease codes. In the deep learning regime, recent works have suffered from large label set and heavy imbalance distribution. To mitigate the negative effect in such scenarios, we propose a retrieve and rerank framework that introduces the Contrastive Learning (CL) for label retrieval, allowing the model to make more accurate prediction from a simplified label space. Given the appealing discriminative power of CL, we adopt it as the training strategy to replace the standard cross-entropy objective and retrieve a small subset by taking the distance between clinical notes and ICD codes into account. After properly training, the retriever could implicitly capture the code co-occurrence, which makes up for the deficiency of cross-entropy assigning each label independently of the others. Further, we evolve a powerful model via a Transformer variant for refining and reranking the candidate set, which can extract semantically meaningful features from long clinical sequences. Applying our method on well-known models, experiments show that our framework provides more accurate results guaranteed by preselecting a small subset of candidates before fine-level reranking. Relying on the framework, our proposed model achieves 0.590 and 0.990 in terms of Micro-F1 and Micro-AUC on benchmark MIMIC-III.


Subject(s)
Electronic Health Records , International Classification of Diseases , Humans
8.
Bioinformatics ; 39(5)2023 05 04.
Article in English | MEDLINE | ID: mdl-37084258

ABSTRACT

MOTIVATION: Hi-C technology has been the most widely used chromosome conformation capture (3C) experiment that measures the frequency of all paired interactions in the entire genome, which is a powerful tool for studying the 3D structure of the genome. The fineness of the constructed genome structure depends on the resolution of Hi-C data. However, due to the fact that high-resolution Hi-C data require deep sequencing and thus high experimental cost, most available Hi-C data are in low-resolution. Hence, it is essential to enhance the quality of Hi-C data by developing the effective computational methods. RESULTS: In this work, we propose a novel method, so-called DFHiC, which generates the high-resolution Hi-C matrix from the low-resolution Hi-C matrix in the framework of the dilated convolutional neural network. The dilated convolution is able to effectively explore the global patterns in the overall Hi-C matrix by taking advantage of the information of the Hi-C matrix in a way of the longer genomic distance. Consequently, DFHiC can improve the resolution of the Hi-C matrix reliably and accurately. More importantly, the super-resolution Hi-C data enhanced by DFHiC is more in line with the real high-resolution Hi-C data than those done by the other existing methods, in terms of both chromatin significant interactions and identifying topologically associating domains. AVAILABILITY AND IMPLEMENTATION: https://github.com/BinWangCSU/DFHiC.


Subject(s)
Chromatin , Chromosomes , Chromatin/genetics , Genome , Genomics , Neural Networks, Computer
9.
IEEE/ACM Trans Comput Biol Bioinform ; 20(5): 2712-2723, 2023.
Article in English | MEDLINE | ID: mdl-34110998

ABSTRACT

The Anatomical Therapeutic Chemical (ATC) classification system, designated by the World Health Organization Collaborating Center (WHOCC), has been widely used in drug screening, repositioning, and similarity research. The ATC classification system assigns different codes to drugs according to the organ or system on which they act and/or their therapeutic and chemical characteristics. Correctly identifying the potential ATC codes for drugs can accelerate drug development and reduce the cost of experiments. Several classifiers have been proposed in this regard. However, they lack of ability to learn basic features from sparsely known drug-ATC code associations. Therefore, there is an urgent need for novel computational methods to precisely predict potential drug-ATC code associations in multiple levels of the ATC classification system based on known associations between drugs and ATC codes. In this paper, we provide a novel end-to-end model, so-called RNPredATC, to predict potential drug-ATC code associations in five ATC classification levels. RNPredATC can extract dense feature vectors from sparsely known drug-ATC code associations and reduce the impact from the degradation problem by a novel deep residual learning. We extensively compare our method with some state-of-the-art methods, including NetPredATC, SPACE, and some multi-label-based methods. Our experimental results show that RNPredATC achieves better performances in five-fold and ten-fold cross validations. Furthermore, the visualization analysis of hidden layers and case studies of predicted associations at the fifth ATC classification level confirm that RNPredATC can effectively identify the potential ATC codes of drugs.

10.
Article in English | MEDLINE | ID: mdl-35476573

ABSTRACT

The understanding of protein functions is critical to many biological problems such as the development of new drugs and new crops. To reduce the huge gap between the increase of protein sequences and annotations of protein functions, many methods have been proposed to deal with this problem. These methods use Gene Ontology (GO) to classify the functions of proteins and consider one GO term as a class label. However, they ignore the co-occurrence of GO terms that is helpful for protein function prediction. We propose a new deep learning model, named DeepPFP-CO, which uses Graph Convolutional Network (GCN) to explore and capture the co-occurrence of GO terms to improve the protein function prediction performance. In this way, we can further deduce the protein functions by fusing the predicted propensity of the center function and its co-occurrence functions. We use Fmax and AUPR to evaluate the performance of DeepPFP-CO and compare DeepPFP-CO with state-of-the-art methods such as DeepGOPlus and DeepGOA. The computational results show that DeepPFP-CO outperforms DeepGOPlus and other methods. Moreover, we further analyze our model at the protein level. The results have demonstrated that DeepPFP-CO improves the performance of protein function prediction. DeepPFP-CO is available at https://csuligroup.com/DeepPFP/.


Subject(s)
Deep Learning , Gene Ontology , Proteins/genetics , Amino Acid Sequence
11.
Article in English | MEDLINE | ID: mdl-35104223

ABSTRACT

Topologically associating domains (TADs) are local chromatin interaction domains, which have been shown to play an important role in gene expression regulation. TADs were originally discovered in the investigation of 3D genome organization based on High-throughput Chromosome Conformation Capture (Hi-C) data. Continuous considerable efforts have been dedicated to developing methods for detecting TADs from Hi-C data. Different computational methods for TADs identification vary in their assumptions and criteria in calling TADs. As a consequence, the TADs called by these methods differ in their similarities and biological features they are enriched in. In this work, we performed a systematic comparison of twenty-six TAD callers. We first compared the TADs and gaps between adjacent TADs across different methods, resolutions, and sequencing depths. We then assessed the quality of TADs and TAD boundaries according to three criteria: the decay of contact frequencies over the genomic distance, enrichment and depletion of regulatory elements around TAD boundaries, and reproducibility of TADs and TAD boundaries in replicate samples. Last, due to the lack of a gold standard of TADs, we also evaluated the performance of the methods on synthetic datasets. We discussed the key principles of TAD callers, and pinpointed current situation in the detection of TADs. We provide a concise, comprehensive, and systematic framework for evaluating the performance of TAD callers, and expect our work will provide useful guidance in choosing suitable approaches for the detection and evaluation of TADs.


Subject(s)
Chromatin , Chromosomes , Reproducibility of Results , Chromatin/genetics , Chromosomes/genetics , Genome , Gene Expression Regulation
12.
Article in English | MEDLINE | ID: mdl-35471889

ABSTRACT

The identification of drug-target relations (DTRs) is substantial in drug development. A large number of methods treat DTRs as drug-target interactions (DTIs), a binary classification problem. The main drawback of these methods are the lack of reliable negative samples and the absence of many important aspects of DTR, including their dose dependence and quantitative affinities. With increasing number of publications of drug-protein binding affinity data recently, DTRs prediction can be viewed as a regression problem of drug-target affinities (DTAs) which reflects how tightly the drug binds to the target and can present more detailed and specific information than DTIs. The growth of affinity data enables the use of deep learning architectures, which have been shown to be among the state-of-the-art methods in binding affinity prediction. Although relatively effective, due to the black-box nature of deep learning, these models are less biologically interpretable. In this study, we proposed a deep learning-based model, named AttentionDTA, which uses attention mechanism to predict DTAs. Different from the models using 3D structures of drug-target complexes or graph representation of drugs and proteins, the novelty of our work is to use attention mechanism to focus on key subsequences which are important in drug and protein sequences when predicting its affinity. We use two separate one-dimensional Convolution Neural Networks (1D-CNNs) to extract the semantic information of drug's SMILES string and protein's amino acid sequence. Furthermore, a two-side multi-head attention mechanism is developed and embedded to our model to explore the relationship between drug features and protein features. We evaluate our model on three established DTA benchmark datasets, Davis, Metz, and KIBA. AttentionDTA outperforms the state-of-the-art deep learning methods under different evaluation metrics. The results show that the attention-based model can effectively extract protein features related to drug information and drug features related to protein information to better predict drug target affinities. It is worth mentioning that we test our model on IC50 dataset, which provides the binding sites between drugs and proteins, to evaluate the ability of our model to locate binding sites. Finally, we visualize the attention weight to demonstrate the biological significance of the model. The source code of AttentionDTA can be downloaded from https://github.com/zhaoqichang/AttentionDTA_TCBB.


Subject(s)
Deep Learning , Drug Development , Binding Sites , Amino Acid Sequence , Benchmarking
13.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36511222

ABSTRACT

Circular RNAs (circRNAs) are reverse-spliced and covalently closed RNAs. Their interactions with RNA-binding proteins (RBPs) have multiple effects on the progress of many diseases. Some computational methods are proposed to identify RBP binding sites on circRNAs but suffer from insufficient accuracy, robustness and explanation. In this study, we first take the characteristics of both RNA and RBP into consideration. We propose a method for discriminating circRNA-RBP binding sites based on multi-scale characterizing sequence and structure features, called CRMSS. For circRNAs, we use sequence ${k}\hbox{-}{mer}$ embedding and the forming probabilities of local secondary structures as features. For RBPs, we combine sequence and structure frequencies of RNA-binding domain regions to generate features. We capture binding patterns with multi-scale residual blocks. With BiLSTM and attention mechanism, we obtain the contextual information of high-level representation for circRNA-RBP binding. To validate the effectiveness of CRMSS, we compare its predictive performance with other methods on 37 RBPs. Taking the properties of both circRNAs and RBPs into account, CRMSS achieves superior performance over state-of-the-art methods. In the case study, our model provides reliable predictions and correctly identifies experimentally verified circRNA-RBP pairs. The code of CRMSS is freely available at https://github.com/BioinformaticsCSU/CRMSS.


Subject(s)
RNA, Circular , RNA , RNA, Circular/genetics , Binding Sites , RNA/metabolism , RNA-Binding Proteins/metabolism
14.
IEEE/ACM Trans Comput Biol Bioinform ; 20(3): 1943-1952, 2023.
Article in English | MEDLINE | ID: mdl-36445997

ABSTRACT

Drug discovery and drug repurposing often rely on the successful prediction of drug-target interactions (DTIs). Recent advances have shown great promise in applying deep learning to drug-target interaction prediction. One challenge in building deep learning-based models is to adequately represent drugs and proteins that encompass the fundamental local chemical environments and long-distance information among amino acids of proteins (or atoms of drugs). Another challenge is to efficiently model the intermolecular interactions between drugs and proteins, which plays vital roles in the DTIs. To this end, we propose a novel model, GIFDTI, which consists of three key components: the sequence feature extractor (CNNFormer), the global molecular feature extractor (GF), and the intermolecular interaction modeling module (IIF). Specifically, CNNFormer incorporates CNN and Transformer to capture the local patterns and encode the long-distance relationship among tokens (atoms or amino acids) in a sequence. Then, GF and IIF extract the global molecular features and the intermolecular interaction features, respectively. We evaluate GIFDTI on six realistic evaluation strategies and the results show it improves DTI prediction performance compared to state-of-the-art methods. Moreover, case studies confirm that our model can be a useful tool to accurately yield low-cost DTIs. The codes of GIFDTI are available at https://github.com/zhaoqichang/GIFDTI.


Subject(s)
Drug Development , Proteins , Proteins/chemistry , Drug Development/methods , Drug Discovery/methods , Drug Repositioning , Amino Acids
15.
Bioinformatics ; 38(17): 4153-4161, 2022 09 02.
Article in English | MEDLINE | ID: mdl-35801934

ABSTRACT

MOTIVATION: Identifying drug-target interactions is a crucial step for drug discovery and design. Traditional biochemical experiments are credible to accurately validate drug-target interactions. However, they are also extremely laborious, time-consuming and expensive. With the collection of more validated biomedical data and the advancement of computing technology, the computational methods based on chemogenomics gradually attract more attention, which guide the experimental verifications. RESULTS: In this study, we propose an end-to-end deep learning-based method named IIFDTI to predict drug-target interactions (DTIs) based on independent features of drug-target pairs and interactive features of their substructures. First, the interactive features of substructures between drugs and targets are extracted by the bidirectional encoder-decoder architecture. The independent features of drugs and targets are extracted by the graph neural networks and convolutional neural networks, respectively. Then, all extracted features are fused and inputted into fully connected dense layers in downstream tasks for predicting DTIs. IIFDTI takes into account the independent features of drugs/targets and simulates the interactive features of the substructures from the biological perspective. Multiple experiments show that IIFDTI outperforms the state-of-the-art methods in terms of the area under the receiver operating characteristics curve (AUC), the area under the precision-recall curve (AUPR), precision, and recall on benchmark datasets. In addition, the mapped visualizations of attention weights indicate that IIFDTI has learned the biological knowledge insights, and two case studies illustrate the capabilities of IIFDTI in practical applications. AVAILABILITY AND IMPLEMENTATION: The data and codes underlying this article are available in Github at https://github.com/czjczj/IIFDTI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Drug Discovery , Neural Networks, Computer , Drug Interactions , Area Under Curve , Drug Discovery/methods , ROC Curve
16.
IEEE J Biomed Health Inform ; 26(10): 5201-5212, 2022 10.
Article in English | MEDLINE | ID: mdl-35867367

ABSTRACT

Automatic International Classification of Diseases (ICD) coding is defined as a kind of text multi-label classification problem, which is difficult because the number of labels is very large and the distribution of labels is unbalanced. The label-wise attention mechanism is widely used in automatic ICD coding because it can assign weights to every word in full Electronic Medical Records (EMR) for different ICD codes. However, the label-wise attention mechanism is redundant and costly in computing. In this paper, we propose a pseudo label-wise attention mechanism to tackle the problem. Instead of computing different attention modes for different ICD codes, the pseudo label-wise attention mechanism automatically merges similar ICD codes and computes only one attention mode for the similar ICD codes, which greatly compresses the number of attention modes and improves the predicted accuracy. In addition, we apply a more convenient and effective way to obtain the ICD vectors, and thus our model can predict new ICD codes by calculating the similarities between EMR vectors and ICD vectors. Our model demonstrates effectiveness in extensive computational experiments. On the public MIMIC-III dataset and private Xiangya dataset, our model achieves the best performance on micro F1 (0.583 and 0.806), micro AUC (0.986 and 0.994), P@8 (0.756 and 0.413), and costs much smaller GPU memory (about 26.1% of the models with label-wise attention). Furthermore, we verify the ability of our model in predicting new ICD codes. The interpretablility analysis and case study show the effectiveness and reliability of the patterns obtained by the pseudo label-wise attention mechanism.


Subject(s)
Electronic Health Records , International Classification of Diseases , Clinical Coding , Humans , Reproducibility of Results
17.
Bioinformatics ; 38(7): 1995-2002, 2022 03 28.
Article in English | MEDLINE | ID: mdl-35043942

ABSTRACT

MOTIVATION: The identification of compound-protein interactions (CPIs) is an essential step in the process of drug discovery. The experimental determination of CPIs is known for a large amount of funds and time it consumes. Computational model has therefore become a promising and efficient alternative for predicting novel interactions between compounds and proteins on a large scale. Most supervised machine learning prediction models are approached as a binary classification problem, which aim to predict whether there is an interaction between the compound and the protein or not. However, CPI is not a simple binary on-off relationship, but a continuous value reflects how tightly the compound binds to a particular target protein, also called binding affinity. RESULTS: In this study, we propose an end-to-end neural network model, called BACPI, to predict CPI and binding affinity. We employ graph attention network and convolutional neural network (CNN) to learn the representations of compounds and proteins and develop a bi-directional attention neural network model to integrate the representations. To evaluate the performance of BACPI, we use three CPI datasets and four binding affinity datasets in our experiments. The results show that, when predicting CPIs, BACPI significantly outperforms other available machine learning methods on both balanced and unbalanced datasets. This suggests that the end-to-end neural network model that predicts CPIs directly from low-level representations is more robust than traditional machine learning-based methods. And when predicting binding affinities, BACPI achieves higher performance on large datasets compared to other state-of-the-art deep learning methods. This comparison result suggests that the proposed method with bi-directional attention neural network can capture the important regions of compounds and proteins for binding affinity prediction. AVAILABILITY AND IMPLEMENTATION: Data and source codes are available at https://github.com/CSUBioGroup/BACPI.


Subject(s)
Neural Networks, Computer , Software , Proteins/chemistry , Machine Learning , Drug Discovery/methods
18.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 2092-2110, 2022.
Article in English | MEDLINE | ID: mdl-33769935

ABSTRACT

The identification of compound-protein relations (CPRs), which includes compound-protein interactions (CPIs) and compound-protein affinities (CPAs), is critical to drug development. A common method for compound-protein relation identification is the use of in vitro screening experiments. However, the number of compounds and proteins is massive, and in vitro screening experiments are labor-intensive, expensive, and time-consuming with high failure rates. Researchers have developed a computational field called virtual screening (VS) to aid experimental drug development. These methods utilize experimentally validated biological interaction information to generate datasets and use the physicochemical and structural properties of compounds and target proteins as input information to train computational prediction models. At present, deep learning has been widely used in computer vision and natural language processing and has experienced epoch-making progress. At the same time, deep learning has also been used in the field of biomedicine widely, and the prediction of CPRs based on deep learning has developed rapidly and has achieved good results. The purpose of this study is to investigate and discuss the latest applications of deep learning techniques in CPR prediction. First, we describe the datasets and feature engineering (i.e., compound and protein representations and descriptors) commonly used in CPR prediction methods. Then, we review and classify recent deep learning approaches in CPR prediction. Next, a comprehensive comparison is performed to demonstrate the prediction performance of representative methods on classical datasets. Finally, we discuss the current state of the field, including the existing challenges and our proposed future directions. We believe that this investigation will provide sufficient references and insight for researchers to understand and develop new deep learning methods to enhance CPR predictions.


Subject(s)
Deep Learning , Proteins , Computer Simulation , Proteins/chemistry
19.
IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3263-3271, 2022.
Article in English | MEDLINE | ID: mdl-34699365

ABSTRACT

Essential proteins are considered the foundation of life as they are indispensable for the survival of living organisms. Computational methods for essential protein discovery provide a fast way to identify essential proteins. But most of them heavily rely on various biological information, especially protein-protein interaction networks, which limits their practical applications. With the rapid development of high-throughput sequencing technology, sequencing data has become the most accessible biological data. However, using only protein sequence information to predict essential proteins has limited accuracy. In this paper, we propose EP-EDL, an ensemble deep learning model using only protein sequence information to predict human essential proteins. EP-EDL integrates multiple classifiers to alleviate the class imbalance problem and to improve prediction accuracy and robustness. In each base classifier, we employ multi-scale text convolutional neural networks to extract useful features from protein sequence feature matrices with evolutionary information. Our computational results show that EP-EDL outperforms the state-of-the-art sequence-based methods. Furthermore, EP-EDL provides a more practical and flexible way for biologists to accurately predict essential proteins. The source code and datasets can be downloaded from https://github.com/CSUBioGroup/EP-EDL.


Subject(s)
Deep Learning , Humans , Neural Networks, Computer , Proteins/genetics , Amino Acid Sequence , Software
20.
Brief Bioinform ; 22(6)2021 11 05.
Article in English | MEDLINE | ID: mdl-34213525

ABSTRACT

Identifying the frequencies of the drug-side effects is a very important issue in pharmacological studies and drug risk-benefit. However, designing clinical trials to determine the frequencies is usually time consuming and expensive, and most existing methods can only predict the drug-side effect existence or associations, not their frequencies. Inspired by the recent progress of graph neural networks in the recommended system, we develop a novel prediction model for drug-side effect frequencies, using a graph attention network to integrate three different types of features, including the similarity information, known drug-side effect frequency information and word embeddings. In comparison, the few available studies focusing on frequency prediction use only the known drug-side effect frequency scores. One novel approach used in this work first decomposes the feature types in drug-side effect graph to extract different view representation vectors based on three different type features, and then recombines these latent view vectors automatically to obtain unified embeddings for prediction. The proposed method demonstrates high effectiveness in 10-fold cross-validation. The computational results show that the proposed method achieves the best performance in the benchmark dataset, outperforming the state-of-the-art matrix decomposition model. In addition, some ablation experiments and visual analyses are also supplied to illustrate the usefulness of our method for the prediction of the drug-side effect frequencies. The codes of MGPred are available at https://github.com/zhc940702/MGPred and https://zenodo.org/record/4449613.


Subject(s)
Drug-Related Side Effects and Adverse Reactions/diagnosis , Medical Informatics/methods , Software , Algorithms , Benchmarking , Databases, Factual , Deep Learning , Drug Interactions , Drug-Related Side Effects and Adverse Reactions/etiology , Humans , Reproducibility of Results
SELECTION OF CITATIONS
SEARCH DETAIL
...