Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation.

Jiang, Chao; Ngo, Victoria; Chapman, Richard; Yu, Yue; Liu, Hongfang; Jiang, Guoqian; Zong, Nansu

Jiang, Chao; Ngo, Victoria; Chapman, Richard; Yu, Yue; Liu, Hongfang; Jiang, Guoqian; Zong, Nansu.

Jiang C; Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, United States.
Ngo V; Center for Innovation to Implementation, VA Palo Alto Health Care System, Sacramento, CA, United States.
Chapman R; Stanford Health Policy, Stanford School of Medicine, Stanford University, Stanford, CA, United States.
Yu Y; Freeman Spogli Institute for International Studies, Stanford University, Stanford, CA, United States.
Liu H; Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, United States.
Jiang G; Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, United States.
Zong N; Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, United States.

J Med Internet Res ; 24(7): e38584, 2022 07 06.

Article in English | MEDLINE | ID: covidwho-1933490

ABSTRACT

ABSTRACT

BACKGROUND:

Multiple types of biomedical associations of knowledge graphs, including COVID-19-related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities.

OBJECTIVE:

Data quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model's performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information.

METHODS:

The proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator.

RESULTS:

The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 19, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available.

CONCLUSIONS:

Our preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data.

Subject(s)

COVID-19; Humans; Knowledge; Neural Networks, Computer; Pattern Recognition, Automated; ROC Curve

Keywords

COVID-19; adversarial generative network; biomedical; deep denoising; knowledge graph; machine learning; network model; neural network; training data

Fulltext

XML

PubMed Links

Search on Google

Full text: Available Collection: International databases Database: MEDLINE Main subject: COVID-19 Type of study: Prognostic study / Randomized controlled trials Limits: Humans Language: English Journal: J Med Internet Res Journal subject: Medical Informatics Year: 2022 Document Type: Article Affiliation country: 38584

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

PubMed Links

Search on Google