Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38725156

ABSTRACT

Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.


Subject(s)
Neural Networks, Computer , Protein Processing, Post-Translational , Acetylation , Computational Biology/methods , Databases, Protein , Software , Algorithms , Humans , Proteins/chemistry , Proteins/metabolism
2.
Comput Biol Med ; 175: 108487, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38653064

ABSTRACT

Drug repurposing is promising in multiple scenarios, such as emerging viral outbreak controls and cost reductions of drug discovery. Traditional graph-based drug repurposing methods are limited to fast, large-scale virtual screens, as they constrain the counts for drugs and targets and fail to predict novel viruses or drugs. Moreover, though deep learning has been proposed for drug repurposing, only a few methods have been used, including a group of pre-trained deep learning models for embedding generation and transfer learning. Hence, we propose DeepSeq2Drug to tackle the shortcomings of previous methods. We leverage multi-modal embeddings and an ensemble strategy to complement the numbers of drugs and viruses and to guarantee the novel prediction. This framework (including the expanded version) involves four modal types: six NLP models, four CV models, four graph models, and two sequence models. In detail, we first make a pipeline and calculate the predictive performance of each pair of viral and drug embeddings. Then, we select the best embedding pairs and apply an ensemble strategy to conduct anti-viral drug repurposing. To validate the effect of the proposed ensemble model, a monkeypox virus (MPV) case study is conducted to reflect the potential predictive capability. This framework could be a benchmark method for further pre-trained deep learning optimization and anti-viral drug repurposing tasks. We also build software further to make the proposed model easier to reuse. The code and software are freely available at http://deepseq2drug.cs.cityu.edu.hk.


Subject(s)
Antiviral Agents , Deep Learning , Drug Repositioning , Drug Repositioning/methods , Antiviral Agents/pharmacology , Antiviral Agents/therapeutic use , Humans , Software , Benchmarking
3.
iScience ; 27(4): 109352, 2024 Apr 19.
Article in English | MEDLINE | ID: mdl-38510148

ABSTRACT

Gene regulatory networks (GRNs) involve complex and multi-layer regulatory interactions between regulators and their target genes. Precise knowledge of GRNs is important in understanding cellular processes and molecular functions. Recent breakthroughs in single-cell sequencing technology made it possible to infer GRNs at single-cell level. Existing methods, however, are limited by expensive computations, and sometimes simplistic assumptions. To overcome these obstacles, we propose scGREAT, a framework to infer GRN using gene embeddings and transformer from single-cell transcriptomics. scGREAT starts by constructing gene expression and gene biotext dictionaries from scRNA-seq data and gene text information. The representation of TF gene pairs is learned through optimizing embedding space by transformer-based engine. Results illustrated scGREAT outperformed other contemporary methods on benchmarks. Besides, gene representations from scGREAT provide valuable gene regulation insights, and external validation on spatial transcriptomics illuminated the mechanism behind scGREAT annotation. Moreover, scGREAT identified several TF target regulations corroborated in studies.

4.
iScience ; 26(11): 108197, 2023 Nov 17.
Article in English | MEDLINE | ID: mdl-37965148

ABSTRACT

By soaking microRNAs (miRNAs), long non-coding RNAs (lncRNAs) have the potential to regulate gene expression. Few methods have been created based on this mechanism to anticipate the lncRNA-gene relationship prediction. Hence, we present lncRNA-Top to forecast potential lncRNA-gene regulation relationships. Specifically, we constructed controlled deep-learning methods using 12417 lncRNAs and 16127 genes. We have provided retrospective and innovative views among negative sampling, random seeds, cross-validation, metrics, and independent datasets. The AUC, AUPR, and our defined precision@k were leveraged to evaluate performance. In-depth case studies demonstrate that 47 out of 100 projected top unknown pairings were recorded in publications, supporting the predictive power. Our additional software can annotate the scores with target candidates. The lncRNA-Top will be a helpful tool to uncover prospective lncRNA targets and better comprehend the regulatory processes of lncRNAs.

5.
Adv Sci (Weinh) ; 10(33): e2303502, 2023 11.
Article in English | MEDLINE | ID: mdl-37816141

ABSTRACT

Single-cell Hi-C (scHi-C) has made it possible to analyze chromatin organization at the single-cell level. However, scHi-C experiments generate inherently sparse data, which poses a challenge for loop calling methods. The existing approach performs significance tests across the imputed dense contact maps, leading to substantial computational overhead and loss of information at the single-cell level. To overcome this limitation, a lightweight framework called scGSLoop is proposed, which sets a new paradigm for scHi-C loop calling by adapting the training and inferencing strategies of graph-based deep learning to leverage the sequence features and 1D positional information of genomic loci. With this framework, sparsity is no longer a challenge, but rather an advantage that the model leverages to achieve unprecedented computational efficiency. Compared to existing methods, scGSLoop makes more accurate predictions and is able to identify more loops that have the potential to play regulatory roles in genome functioning. Moreover, scGSLoop preserves single-cell information by identifying a distinct group of loops for each individual cell, which not only enables an understanding of the variability of chromatin looping states between cells, but also allows scGSLoop to be extended for the investigation of multi-connected hubs and their underlying mechanisms.


Subject(s)
Chromatin , Genomics , Chromatin/genetics , Genome
6.
Bioinformatics ; 39(7)2023 07 01.
Article in English | MEDLINE | ID: mdl-37399092

ABSTRACT

MOTIVATION: Chromothripsis, associated with poor clinical outcomes, is prognostically vital in multiple myeloma. The catastrophic event is reported to be detectable prior to the progression of multiple myeloma. As a result, chromothripsis detection can contribute to risk estimation and early treatment guidelines for multiple myeloma patients. However, manual diagnosis remains the gold standard approach to detect chromothripsis events with the whole-genome sequencing technology to retrieve both copy number variation (CNV) and structural variation data. Meanwhile, CNV data are much easier to obtain than structural variation data. Hence, in order to reduce the reliance on human experts' efforts and structural variation data extraction, it is necessary to establish a reliable and accurate chromothripsis detection method based on CNV data. RESULTS: To address those issues, we propose a method to detect chromothripsis solely based on CNV data. With the help of structure learning, the intrinsic relationship-directed acyclic graph of CNV features is inferred to derive a CNV embedding graph (i.e. CNV-DAG). Subsequently, a neural network based on Graph Transformer, local feature extraction, and non-linear feature interaction, is proposed with the embedding graph as the input to distinguish whether the chromothripsis event occurs. Ablation experiments, clustering, and feature importance analysis are also conducted to enable the proposed model to be explained by capturing mechanistic insights. AVAILABILITY AND IMPLEMENTATION: The source code and data are freely available at https://github.com/luvyfdawnYu/CNV_chromothripsis.


Subject(s)
Chromothripsis , Multiple Myeloma , Humans , Multiple Myeloma/diagnosis , Multiple Myeloma/genetics , DNA Copy Number Variations , Software , Neural Networks, Computer
7.
Adv Sci (Weinh) ; 10(11): e2204113, 2023 04.
Article in English | MEDLINE | ID: mdl-36762572

ABSTRACT

The single-cell RNA sequencing (scRNA-seq) quantifies the gene expression of individual cells, while the bulk RNA sequencing (bulk RNA-seq) characterizes the mixed transcriptome of cells. The inference of drug sensitivities for individual cells can provide new insights to understand the mechanism of anti-cancer response heterogeneity and drug resistance at the cellular resolution. However, pharmacogenomic information related to their corresponding scRNA-Seq is often limited. Therefore, a transfer learning model is proposed to infer the drug sensitivities at single-cell level. This framework learns bulk transcriptome profiles and pharmacogenomics information from population cell lines in a large public dataset and transfers the knowledge to infer drug efficacy of individual cells. The results suggest that it is suitable to learn knowledge from pre-clinical cell lines to infer pre-existing cell subpopulations with different drug sensitivities prior to drug exposure. In addition, the model offers a new perspective on drug combinations. It is observed that drug-resistant subpopulation can be sensitive to other drugs (e.g., a subset of JHU006 is Vorinostat-resistant while Gefitinib-sensitive); such finding corroborates the previously reported drug combination (Gefitinib + Vorinostat) strategy in several cancer types. The identified drug sensitivity biomarkers reveal insights into the tumor heterogeneity and treatment at cellular resolution.


Subject(s)
Transcriptome , RNA-Seq/methods , Gefitinib , Vorinostat , Transcriptome/genetics , Sequence Analysis, RNA/methods
8.
iScience ; 25(12): 105535, 2022 Dec 22.
Article in English | MEDLINE | ID: mdl-36444296

ABSTRACT

Graph and image are two common representations of Hi-C cis-contact maps. Existing computational tools have only adopted Hi-C data modeled as unitary data structures but neglected the potential advantages of synergizing the information of different views. Here we propose GILoop, a dual-branch neural network that learns from both representations to identify genome-wide CTCF-mediated loops. With GILoop, we explore the combined strength of integrating the two view representations of Hi-C data and corroborate the complementary relationship between the views. In particular, the model outperforms the state-of-the-art loop calling framework and is also more robust against low-quality Hi-C libraries. We also uncover distinct preferences for matrix density by graph-based and image-based models, revealing interesting insights into Hi-C data elucidation. Finally, along with multiple transfer-learning case studies, we demonstrate that GILoop can accurately model the organizational and functional patterns of CTCF-mediated looping across different cell lines.

9.
Brief Bioinform ; 23(6)2022 11 19.
Article in English | MEDLINE | ID: mdl-36274236

ABSTRACT

MOTIVATION: The identification of drug-target interactions (DTIs) plays a vital role for in silico drug discovery, in which the drug is the chemical molecule, and the target is the protein residues in the binding pocket. Manual DTI annotation approaches remain reliable; however, it is notoriously laborious and time-consuming to test each drug-target pair exhaustively. Recently, the rapid growth of labelled DTI data has catalysed interests in high-throughput DTI prediction. Unfortunately, those methods highly rely on the manual features denoted by human, leading to errors. RESULTS: Here, we developed an end-to-end deep learning framework called CoaDTI to significantly improve the efficiency and interpretability of drug target annotation. CoaDTI incorporates the Co-attention mechanism to model the interaction information from the drug modality and protein modality. In particular, CoaDTI incorporates transformer to learn the protein representations from raw amino acid sequences, and GraphSage to extract the molecule graph features from SMILES. Furthermore, we proposed to employ the transfer learning strategy to encode protein features by pre-trained transformer to address the issue of scarce labelled data. The experimental results demonstrate that CoaDTI achieves competitive performance on three public datasets compared with state-of-the-art models. In addition, the transfer learning strategy further boosts the performance to an unprecedented level. The extended study reveals that CoaDTI can identify novel DTIs such as reactions between candidate drugs and severe acute respiratory syndrome coronavirus 2-associated proteins. The visualization of co-attention scores can illustrate the interpretability of our model for mechanistic insights. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/CoaDTI.


Subject(s)
COVID-19 , Humans , Computer Simulation , Proteins/chemistry , Amino Acid Sequence , Drug Discovery/methods
10.
IEEE J Biomed Health Inform ; 26(8): 4303-4313, 2022 08.
Article in English | MEDLINE | ID: mdl-35439152

ABSTRACT

Exploring the prognostic classification and biomarkers in Head and Neck Squamous Carcinoma (HNSC) is of great clinical significance. We hybridized three prominent strategies to comprehensively characterize the molecular features of HNSC. We constructed a 15-gene signature to predict patients' death risk with an average AUC of 0.744 for 1-, 3-, and 5-year on TCGA-HNSC training set, and average AUCs of 0.636, 0.584, 0.755 in GSE65858, GSE-112026, CPTAC-HNSCC datasets, respectively. By combined with NMF clustering and consensus clustering of fraction of tumor immune cell infiltration (ICI) in the tumor microenvironment (TME), we captured a more refined biological characteristics of HNSC, and observed a prognosis heterogeneity in high tumor immunity patients. By matching tumor subset-specific expression signatures to drug-induced cell line expression profiles from large-scale pharmacogenomic databases in the OCTAD workspace, we identified a group of HNSC patients featured with poor prognosis and demonstrated that the individuals in this group are likely to receive increased drug sensitivity to reverse differentially expressed disease signature genes. This trend is especially highlighted among those with higher death risk and tumour immunity.


Subject(s)
Gene Expression Profiling , Head and Neck Neoplasms , Biomarkers, Tumor/genetics , Head and Neck Neoplasms/drug therapy , Head and Neck Neoplasms/genetics , Humans , Prognosis , Squamous Cell Carcinoma of Head and Neck/genetics , Transcriptome , Treatment Outcome , Tumor Microenvironment/genetics
11.
IEEE J Biomed Health Inform ; 26(8): 4335-4344, 2022 08.
Article in English | MEDLINE | ID: mdl-35471879

ABSTRACT

Targeted therapy for one for a set of genes has made it possible to apply precision medicine for different patients due to the existence of tumor heterogeneity. However, how to regulate those genes are still problematic. One of the natural regulators of genes is microRNAs. Thus, a better understanding of the miRNA-gene interaction mechanism might contribute to future diagnosis, prevention, and cancer therapy. The interactions between microRNA and genes play an essential role in molecular genetics. The in-vivo experiments validating the relationships between them are time-consuming, money-costly, and labor-intensive. With the development of high-throughput technology, we dealt with tons of biological data. However, extracting features from tremendous raw data and making a mathematical model is still a challenging topic. Machine learning and deep learning algorithms have become powerful tools in dealing with biological data. Inspired by this, in this paper, we propose a model that combines features/embedding extraction methods, deep learning algorithms, and a voting system. We leverage doc2vec to generate sequential embedding from molecular sequences. The role2vec, GCN, and GMM for geometrical embedding were generated from the complex network from similarity and pair-wise datasets. For the deep learning algorithms, we leveraged LSTM and Bi-LSTM according to different embedding and features. Finally, we adopted a voting system to balance results from different data sources. The results have shown that our voting system could achieve a higher AUC than the existing benchmark. The case studies demonstrate that our model could reveal potential relationships between miRNAs and genes. The source code, features, and predictive results can be downloaded at https://github.com/Xshelton/SRG-vote.


Subject(s)
Algorithms , MicroRNAs , Computational Biology/methods , Humans , Machine Learning , MicroRNAs/genetics , Politics , Software
12.
Brief Bioinform ; 23(1)2022 01 17.
Article in English | MEDLINE | ID: mdl-34791012

ABSTRACT

MOTIVATION: The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pretrained on biomedical corpus. In particular, we propose the multihead self-attention mechanism and packed BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pretrained language model BioGPT-2 where the generation sentences are selected based on filtering rules. RESULTS: We evaluated the classification part on 'DDIs 2013' dataset and 'DTIs' dataset, achieving the F1 scores of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the existing growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset demonstrated the potential of EGFI to identify novel drug relationships. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/EGFI.


Subject(s)
Language , Natural Language Processing , Drug Interactions , Semantics , Software
13.
Commun Biol ; 4(1): 496, 2021 04 22.
Article in English | MEDLINE | ID: mdl-33888849

ABSTRACT

Neoantigen-based immunotherapy has yielded promising results in clinical trials. However, it is limited to tumor-specific mutations, and is often tailored to individual patients. Identifying suitable tumor-specific antigens is still a major challenge. Previous proteogenomics studies have identified peptides encoded by predicted non-coding sequences in human genome. To investigate whether tumors express specific peptides encoded by non-coding genes, we analyzed published proteomics data from five cancer types including 933 tumor samples and 275 matched normal samples and compared these to data from 31 different healthy human tissues. Our results reveal that many predicted non-coding genes such as DGCR9 and RHOXF1P3 encode peptides that are overexpressed in tumors compared to normal controls. Furthermore, from the non-coding genes-encoded peptides specifically detected in cancers, we predict a large number of "dark antigens" (neoantigens from non-coding genomic regions), which may provide an alternative source of neoantigens beyond standard tumor specific mutations.


Subject(s)
Antigens, Neoplasm/immunology , Neoplasms/genetics , Peptides/genetics , Proteome/genetics , Antigens, Neoplasm/genetics , Humans , Peptides/metabolism , Proteogenomics
14.
Gut ; 69(5): 877-887, 2020 05.
Article in English | MEDLINE | ID: mdl-31462556

ABSTRACT

OBJECTIVE: Insulinomas and non-functional pancreatic neuroendocrine tumours (NF-PanNETs) have distinctive clinical presentations but share similar pathological features. Their genetic bases have not been comprehensively compared. Herein, we used whole-genome/whole-exome sequencing (WGS/WES) to identify genetic differences between insulinomas and NF-PanNETs. DESIGN: The mutational profiles and copy-number variation (CNV) patterns of 211 PanNETs, including 84 insulinomas and 127 NF-PanNETs, were obtained from WGS/WES data provided by Peking Union Medical College Hospital and the International Cancer Genome Consortium. Insulinoma RNA sequencing and immunohistochemistry data were assayed. RESULTS: PanNETs were categorised based on CNV patterns: amplification, copy neutral and deletion. Insulinomas had CNV amplifications and copy neutral and lacked CNV deletions. CNV-neutral insulinomas exhibited an elevated rate of YY1 mutations. In contrast, NF-PanNETs had all three CNV patterns, and NF-PanNETs with CNV deletions had a high rate of loss-of-function mutations of tumour suppressor genes. NF-PanNETs with CNV alterations (amplification and deletion) had an elevated risk of relapse, and additional DAXX/ATRX mutations could predict an increased relapse risk in the first 2-year period. CONCLUSION: These WGS/WES data allowed a comprehensive assessment of genetic differences between insulinomas and NF-PanNETs, reclassifying these tumours into novel molecular subtypes. We also proposed a novel relapse risk stratification system using CNV patterns and DAXX/ATRX mutations.


Subject(s)
Gene Dosage/genetics , Insulinoma/genetics , Neuroendocrine Tumors/genetics , Pancreatic Neoplasms/genetics , Whole Genome Sequencing/methods , Asymptomatic Diseases/classification , Biopsy, Needle , Diagnosis, Differential , Female , Humans , Immunohistochemistry , Insulinoma/classification , Male , Mutation , Neuroendocrine Tumors/classification , Nuclear Proteins/genetics , Pancreatic Neoplasms/classification , Risk Assessment , Exome Sequencing
SELECTION OF CITATIONS
SEARCH DETAIL
...