Search | VHL Regional Portal

High-quality gene/disease embedding in a multi-relational heterogeneous graph after a joint matrix/tensor decomposition.

Zhou, Kaiyin; Zhang, Sheng; Wang, Yuxing; Cohen, Kevin Bretonnel; Kim, Jin-Dong; Luo, Qi; Yao, Xinzhi; Zhou, Xingyu; Xia, Jingbo.

J Biomed Inform ; 126: 103973, 2022 02.

Article in English | MEDLINE | ID: mdl-34995810

ABSTRACT

MOTIVATION: Node embedding of biological entity network has been widely investigated for the downstream application scenarios. To embed full semantics of gene and disease, a multi-relational heterogeneous graph is considered in a scenario where uni-relation between gene/disease and other heterogeneous entities are abundant while multi-relation between gene and disease is relatively sparse. After introducing this novel graph format, it is illuminative to design a specific data integration algorithm to fully capture the graph information and bring embeddings with high quality. RESULTS: First, a typical multi-relational triple dataset was introduced, which carried significant association between gene and disease. Second, we curated all human genes and diseases in seven mainstream datasets and constructed a large-scale gene-disease network, which compromising 163,024 nodes and 25,265,607 edges, and relates to 27,165 genes, 2,665 diseases, 15,067 chemicals, 108,023 mutations, 2,363 pathways, and 7.732 phenotypes. Third, we proposed a Joint Decomposition of Heterogeneous Matrix and Tensor (JDHMT) model, which integrated all heterogeneous data resources and obtained embedding for each gene or disease. Forth, a visualized intrinsic evaluation was performed, which investigated the embeddings in terms of interpretable data clustering. Furthermore, an extrinsic evaluation was performed in the form of linking prediction. Both intrinsic and extrinsic evaluation results showed that JDHMT model outperformed other eleven state-of-the-art (SOTA) methods which are under relation-learning, proximity-preserving or message-passing paradigms. Finally, the constructed gene-disease network, embedding results and codes were made available. DATA AND CODES AVAILABILITY: The constructed massive gene-disease network is available at: https://hzaubionlp.com/heterogeneous-biological-network/. The codes are available at: https://github.com/bionlp-hzau/JDHMT.

Subject(s)

Algorithms , Semantics , Learning , Phenotype

LitCovid-AGAC: cellular and molecular level annotation data set based on COVID-19.

Ouyang, Sizhuo; Wang, Yuxing; Zhou, Kaiyin; Xia, Jingbo.

Genomics Inform ; 19(3): e23, 2021 Sep.

Article in English | MEDLINE | ID: mdl-34638170

ABSTRACT

Currently, coronavirus disease 2019 (COVID-19) literature has been increasing dramatically, and the increased text amount make it possible to perform large scale text mining and knowledge discovery. Therefore, curation of these texts becomes a crucial issue for Bio-medical Natural Language Processing (BioNLP) community, so as to retrieve the important information about the mechanism of COVID-19. PubAnnotation is an aligned annotation system which provides an efficient platform for biological curators to upload their annotations or merge other external annotations. Inspired by the integration among multiple useful COVID-19 annotations, we merged three annotations resources to LitCovid data set, and constructed a cross-annotated corpus, LitCovid-AGAC. This corpus consists of 12 labels including Mutation, Species, Gene, Disease from PubTator, GO, CHEBI from OGER, Var, MPA, CPA, NegReg, PosReg, Reg from AGAC, upon 50,018 COVID-19 abstracts in LitCovid. Contain sufficient abundant information being possible to unveil the hidden knowledge in the pathological mechanism of COVID-19.

Bridging heterogeneous mutation data to enhance disease gene discovery.

Zhou, Kaiyin; Wang, Yuxing; Bretonnel Cohen, Kevin; Kim, Jin-Dong; Ma, Xiaohang; Shen, Zhixue; Meng, Xiangyu; Xia, Jingbo.

Brief Bioinform ; 22(5)2021 09 02.

Article in English | MEDLINE | ID: mdl-33847357

ABSTRACT

Bridging heterogeneous mutation data fills in the gap between various data categories and propels discovery of disease-related genes. It is known that genome-wide association study (GWAS) infers significant mutation associations that link genotype and phenotype. However, due to the differences of size and quality between GWAS studies, not all de facto vital variations are able to pass the multiple testing. In the meantime, mutation events widely reported in literature unveil typical functional biological process, including mutation types like gain of function and loss of function. To bring together the heterogeneous mutation data, we propose a 'Gene-Disease Association prediction by Mutation Data Bridging (GDAMDB)' pipeline with a statistic generative model. The model learns the distribution parameters of mutation associations and mutation types and recovers false-negative GWAS mutations that fail to pass significant test but represent supportive evidences of functional biological process in literature. Eventually, we applied GDAMDB in Alzheimer's disease (AD) and predicted 79 AD-associated genes. Besides, 12 of them from the original GWAS, 60 of them are supported to be AD-related by other GWAS or literature report, and rest of them are newly predicted genes. Our model is capable of enhancing the GWAS-based gene association discovery by well combining text mining results. The positive result indicates that bridging the heterogeneous mutation data is contributory for the novel disease-related gene discovery.

Subject(s)

Alzheimer Disease/genetics , Genetic Association Studies/methods , Genetic Predisposition to Disease/genetics , Genome-Wide Association Study/methods , Mutation , Polymorphism, Single Nucleotide , Algorithms , Computational Biology/methods , Data Mining/methods , Gene Regulatory Networks/genetics , Genotype , Humans , Phenotype , Protein Interaction Maps/genetics , Reproducibility of Results

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL