Search | VHL Regional Portal

1.

The probability of edge existence due to node degree: a baseline for network-based predictions.

Zietz, Michael; Himmelstein, Daniel S; Kloster, Kyle; Williams, Christopher; Nagle, Michael W; Greene, Casey S.

Gigascience ; 132024 Jan 02.

Article in English | MEDLINE | ID: mdl-38323677

ABSTRACT

Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections using network permutation to generate features that depend only on degree. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Researchers seeking to predict new or missing edges in biological networks should use our permutation approach to obtain a baseline for performance that may be nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

Subject(s)

Algorithms , Probability

2.

Hetnet connectivity search provides rapid insights into how two biomedical entities are related.

Himmelstein, Daniel S; Zietz, Michael; Rubinetti, Vincent; Kloster, Kyle; Heil, Benjamin J; Alquaddoomi, Faisal; Hu, Dongbo; Nicholson, David N; Hao, Yun; Sullivan, Blair D; Nagle, Michael W; Greene, Casey S.

bioRxiv ; 2023 Jan 07.

Article in English | MEDLINE | ID: mdl-36711546

ABSTRACT

Hetnets, short for "heterogeneous networks", contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes - including genes, diseases, drugs, pathways, and anatomical structures - with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search . We provide an open source implementation of these methods in our new Python package named hetmatpy .

3.

The probability of edge existence due to node degree: a baseline for network-based predictions.

Zietz, Michael; Himmelstein, Daniel S; Kloster, Kyle; Williams, Christopher; Nagle, Michael W; Greene, Casey S.

bioRxiv ; 2023 Jan 06.

Article in English | MEDLINE | ID: mdl-36711569

ABSTRACT

Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree's predictive performance diminishes when the networks used for training and testing-despite measuring the same biological relationships-were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

4.

Hetnet connectivity search provides rapid insights into how biomedical entities are related.

Himmelstein, Daniel S; Zietz, Michael; Rubinetti, Vincent; Kloster, Kyle; Heil, Benjamin J; Alquaddoomi, Faisal; Hu, Dongbo; Nicholson, David N; Hao, Yun; Sullivan, Blair D; Nagle, Michael W; Greene, Casey S.

Gigascience ; 122022 12 28.

Article in English | MEDLINE | ID: mdl-37503959

ABSTRACT

BACKGROUND: Hetnets, short for "heterogeneous networks," contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet, connects 11 types of nodes-including genes, diseases, drugs, pathways, and anatomical structures-with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious about not only how metformin is related to breast cancer but also how a given gene might be involved in insomnia. FINDINGS: We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any 2 nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. CONCLUSION: We implemented the method on Hetionet and provide an online interface at https://het.io/search. We provide an open-source implementation of these methods in our new Python package named hetmatpy.

Subject(s)

Algorithms , Probability

5.

AptRank: an adaptive PageRank model for protein function prediction on bi-relational graphs.

Jiang, Biaobin; Kloster, Kyle; Gleich, David F; Gribskov, Michael.

Bioinformatics ; 33(12): 1829-1836, 2017 Jun 15.

Article in English | MEDLINE | ID: mdl-28200073

ABSTRACT

MOTIVATION: Diffusion-based network models are widely used for protein function prediction using protein network data and have been shown to outperform neighborhood-based and module-based methods. Recent studies have shown that integrating the hierarchical structure of the Gene Ontology (GO) data dramatically improves prediction accuracy. However, previous methods usually either used the GO hierarchy to refine the prediction results of multiple classifiers, or flattened the hierarchy into a function-function similarity kernel. No study has taken the GO hierarchy into account together with the protein network as a two-layer network model. RESULTS: We first construct a Bi-relational graph (Birg) model comprised of both protein-protein association and function-function hierarchical networks. We then propose two diffusion-based methods, BirgRank and AptRank, both of which use PageRank to diffuse information on this two-layer graph model. BirgRank is a direct application of traditional PageRank with fixed decay parameters. In contrast, AptRank utilizes an adaptive diffusion mechanism to improve the performance of BirgRank. We evaluate the ability of both methods to predict protein function on yeast, fly and human protein datasets, and compare with four previous methods: GeneMANIA, TMC, ProteinRank and clusDCA. We design four different validation strategies: missing function prediction, de novo function prediction, guided function prediction and newly discovered function prediction to comprehensively evaluate predictability of all six methods. We find that both BirgRank and AptRank outperform the previous methods, especially in missing function prediction when using only 10% of the data for training. AVAILABILITY AND IMPLEMENTATION: The MATLAB code is available at https://github.rcac.purdue.edu/mgribsko/aptrank . CONTACT: gribskov@purdue.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Computational Biology/methods , Gene Ontology , Proteins/metabolism , Software , Algorithms , Animals , Drosophila/metabolism , Humans , Proteins/physiology , Saccharomyces cerevisiae/metabolism

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL