Search | VHL Regional Portal

1.

The probability of edge existence due to node degree: a baseline for network-based predictions.

Zietz, Michael; Himmelstein, Daniel S; Kloster, Kyle; Williams, Christopher; Nagle, Michael W; Greene, Casey S.

Gigascience ; 132024 Jan 02.

Article in English | MEDLINE | ID: mdl-38323677

ABSTRACT

Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections using network permutation to generate features that depend only on degree. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Researchers seeking to predict new or missing edges in biological networks should use our permutation approach to obtain a baseline for performance that may be nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

Subject(s)

Algorithms , Probability

2.

Hetnet connectivity search provides rapid insights into how two biomedical entities are related.

Himmelstein, Daniel S; Zietz, Michael; Rubinetti, Vincent; Kloster, Kyle; Heil, Benjamin J; Alquaddoomi, Faisal; Hu, Dongbo; Nicholson, David N; Hao, Yun; Sullivan, Blair D; Nagle, Michael W; Greene, Casey S.

bioRxiv ; 2023 Jan 07.

Article in English | MEDLINE | ID: mdl-36711546

ABSTRACT

Hetnets, short for "heterogeneous networks", contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes - including genes, diseases, drugs, pathways, and anatomical structures - with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search . We provide an open source implementation of these methods in our new Python package named hetmatpy .

3.

The probability of edge existence due to node degree: a baseline for network-based predictions.

Zietz, Michael; Himmelstein, Daniel S; Kloster, Kyle; Williams, Christopher; Nagle, Michael W; Greene, Casey S.

bioRxiv ; 2023 Jan 06.

Article in English | MEDLINE | ID: mdl-36711569

ABSTRACT

Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree's predictive performance diminishes when the networks used for training and testing-despite measuring the same biological relationships-were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

4.

Unifying the identification of biomedical entities with the Bioregistry.

Hoyt, Charles Tapley; Balk, Meghan; Callahan, Tiffany J; Domingo-Fernández, Daniel; Haendel, Melissa A; Hegde, Harshad B; Himmelstein, Daniel S; Karis, Klas; Kunze, John; Lubiana, Tiago; Matentzoglu, Nicolas; McMurry, Julie; Moxon, Sierra; Mungall, Christopher J; Rutz, Adriano; Unni, Deepak R; Willighagen, Egon; Winston, Donald; Gyori, Benjamin M.

Sci Data ; 9(1): 714, 2022 11 19.

Article in English | MEDLINE | ID: mdl-36402838

ABSTRACT

The standardized identification of biomedical entities is a cornerstone of interoperability, reuse, and data integration in the life sciences. Several registries have been developed to catalog resources maintaining identifiers for biomedical entities such as small molecules, proteins, cell lines, and clinical trials. However, existing registries have struggled to provide sufficient coverage and metadata standards that meet the evolving needs of modern life sciences researchers. Here, we introduce the Bioregistry, an integrative, open, community-driven metaregistry that synthesizes and substantially expands upon 23 existing registries. The Bioregistry addresses the need for a sustainable registry by leveraging public infrastructure and automation, and employing a progressive governance model centered around open code and open data to foster community contribution. The Bioregistry can be used to support the standardized annotation of data, models, ontologies, and scientific literature, thereby promoting their interoperability and reuse. The Bioregistry can be accessed through https://bioregistry.io and its source code and data are available under the MIT and CC0 Licenses at https://github.com/biopragmatics/bioregistry .

5.

Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts.

Nicholson, David N; Himmelstein, Daniel S; Greene, Casey S.

BioData Min ; 15(1): 26, 2022 Oct 18.

Article in English | MEDLINE | ID: mdl-36258252

ABSTRACT

BACKGROUND: Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types. RESULTS: We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1. CONCLUSIONS: Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.

6.

Hetnet connectivity search provides rapid insights into how biomedical entities are related.

Himmelstein, Daniel S; Zietz, Michael; Rubinetti, Vincent; Kloster, Kyle; Heil, Benjamin J; Alquaddoomi, Faisal; Hu, Dongbo; Nicholson, David N; Hao, Yun; Sullivan, Blair D; Nagle, Michael W; Greene, Casey S.

Gigascience ; 122022 12 28.

Article in English | MEDLINE | ID: mdl-37503959

ABSTRACT

BACKGROUND: Hetnets, short for "heterogeneous networks," contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet, connects 11 types of nodes-including genes, diseases, drugs, pathways, and anatomical structures-with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious about not only how metformin is related to breast cancer but also how a given gene might be involved in insomnia. FINDINGS: We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any 2 nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. CONCLUSION: We implemented the method on Hetionet and provide an online interface at https://het.io/search. We provide an open-source implementation of these methods in our new Python package named hetmatpy.

Subject(s)

Algorithms , Probability

7.

PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods.

Romano, Joseph D; Le, Trang T; La Cava, William; Gregg, John T; Goldberg, Daniel J; Chakraborty, Praneel; Ray, Natasha L; Himmelstein, Daniel; Fu, Weixuan; Moore, Jason H.

Bioinformatics ; 38(3): 878-880, 2022 01 12.

Article in English | MEDLINE | ID: mdl-34677586

ABSTRACT

MOTIVATION: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. RESULTS: This release of PMLB (Penn Machine Learning Benchmarks) provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community. AVAILABILITY AND IMPLEMENTATION: PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.

Subject(s)

Benchmarking , Software , Machine Learning , Models, Statistical

8.

An Open-Publishing Response to the COVID-19 Infodemic.

Rando, Halie M; Boca, Simina M; McGowan, Lucy D'Agostino; Himmelstein, Daniel S; Robson, Michael P; Rubinetti, Vincent; Velazquez, Ryan; Greene, Casey S; Gitter, Anthony.

ArXiv ; 2021 Sep 17.

Article in English | MEDLINE | ID: mdl-34545336

ABSTRACT

The COVID-19 pandemic catalyzed the rapid dissemination of papers and preprints investigating the disease and its associated virus, SARS-CoV-2. The multifaceted nature of COVID-19 demands a multidisciplinary approach, but the urgency of the crisis combined with the need for social distancing measures present unique challenges to collaborative science. We applied a massive online open publishing approach to this problem using Manubot. Through GitHub, collaborators summarized and critiqued COVID-19 literature, creating a review manuscript. Manubot automatically compiled citation information for referenced preprints, journal publications, websites, and clinical trials. Continuous integration workflows retrieved up-to-date data from online sources nightly, regenerating some of the manuscript's figures and statistics. Manubot rendered the manuscript into PDF, HTML, LaTeX, and DOCX outputs, immediately updating the version available online upon the integration of new content. Through this effort, we organized over 50 scientists from a range of backgrounds who evaluated over 1,500 sources and developed seven literature reviews. While many efforts from the computational community have focused on mining COVID-19 literature, our project illustrates the power of open publishing to organize both technical and non-technical scientists to aggregate and disseminate information in response to an evolving crisis.

9.

Analysis of scientific society honors reveals disparities.

Le, Trang T; Himmelstein, Daniel S; Hippen, Ariel A; Gazzara, Matthew R; Greene, Casey S.

Cell Syst ; 12(9): 900-906.e5, 2021 09 22.

Article in English | MEDLINE | ID: mdl-34555325

ABSTRACT

Delivering a keynote talk at a conference organized by a scientific society or being named as a fellow by such a society indicates that a scientist is held in high regard by their colleagues. To explore if the distribution of such indicators of esteem in the field of bioinformatics reflects the composition of this field, we compared the gender, name origin, and country of affiliation of 412 honorees from the "International Society for Computational Biology" (75 fellows and 337 keynote speakers) with over 170,000 last authorships on computational biology papers between 1993 and 2019. The proportion of honors bestowed on women was similar to that of the field's overall last authorship rate. However, names of East Asian origin have been persistently underrepresented among honorees. Moreover, there were roughly twice as many honors bestowed on scientists with an affiliation in the United States as expected based on literature authorship. A record of this paper's transparent peer review process is included in the supplemental information.

Subject(s)

Computational Biology , Societies, Scientific , Female , Humans , United States

10.

An Open-Publishing Response to the COVID-19 Infodemic.

Rando, Halie M; Boca, Simina M; McGowan, Lucy D'Agostino; Himmelstein, Daniel S; Robson, Michael P; Rubinetti, Vincent; Velazquez, Ryan; Greene, Casey S; Gitter, Anthony.

CEUR Workshop Proc ; 2976: 29-38, 2021 Sep.

Article in English | MEDLINE | ID: mdl-35558551

ABSTRACT

The COVID-19 pandemic catalyzed the rapid dissemination of papers and preprints investigating the disease and its associated virus, SARS-CoV-2. The multifaceted nature of COVID-19 demands a multidisciplinary approach, but the urgency of the crisis combined with the need for social distancing measures present unique challenges to collaborative science. We applied a massive online open publishing approach to this problem using Manubot. Through GitHub, collaborators summarized and critiqued COVID-19 literature, creating a review manuscript. Manubot automatically compiled citation information for referenced preprints, journal publications, websites, and clinical trials. Continuous integration workflows retrieved up-to-date data from online sources nightly, regenerating some of the manuscript's figures and statistics. Manubot rendered the manuscript into PDF, HTML, LaTeX, and DOCX outputs, immediately updating the version available online upon the integration of new content. Through this effort, we organized over 50 scientists from a range of backgrounds who evaluated over 1,500 sources and developed seven literature reviews. While many efforts from the computational community have focused on mining COVID-19 literature, our project illustrates the power of open publishing to organize both technical and non-technical scientists to aggregate and disseminate information in response to an evolving crisis.

11.

Is authorship sufficient for today's collaborative research? A call for contributor roles.

Vasilevsky, Nicole A; Hosseini, Mohammad; Teplitzky, Samantha; Ilik, Violeta; Mohammadi, Ehsan; Schneider, Juliane; Kern, Barbara; Colomb, Julien; Edmunds, Scott C; Gutzman, Karen; Himmelstein, Daniel S; White, Marijane; Smith, Britton; O'Keefe, Lisa; Haendel, Melissa; Holmes, Kristi L.

Account Res ; 28(1): 23-43, 2021 01.

Article in English | MEDLINE | ID: mdl-32602379

ABSTRACT

Assigning authorship and recognizing contributions to scholarly works is challenging on many levels. Here we discuss ethical, social, and technical challenges to the concept of authorship that may impede the recognition of contributions to a scholarly work. Recent work in the field of authorship shows that shifting to a more inclusive contributorship approach may address these challenges. Recent efforts to enable better recognition of contributions to scholarship include the development of the Contributor Role Ontology (CRO), which extends the CRediT taxonomy and can be used in information systems for structuring contributions. We also introduce the Contributor Attribution Model (CAM), which provides a simple data model that relates the contributor to research objects via the role that they played, as well as the provenance of the information. Finally, requirements for the adoption of a contributorship-based approach are discussed.

Subject(s)

Authorship , Humans

12.

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations.

Way, Gregory P; Zietz, Michael; Rubinetti, Vincent; Himmelstein, Daniel S; Greene, Casey S.

Genome Biol ; 21(1): 109, 2020 05 11.

Article in English | MEDLINE | ID: mdl-32393369

ABSTRACT

BACKGROUND: Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. RESULTS: We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. CONCLUSIONS: There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.

Subject(s)

Data Compression/methods , Gene Expression , Models, Biological , Adult , Child , Humans , Neoplasms/metabolism , Supervised Machine Learning

13.

Open collaborative writing with Manubot.

Himmelstein, Daniel S; Rubinetti, Vincent; Slochower, David R; Hu, Dongbo; Malladi, Venkat S; Greene, Casey S; Gitter, Anthony.

PLoS Comput Biol ; 15(6): e1007128, 2019 06.

Article in English | MEDLINE | ID: mdl-31233491

ABSTRACT

Open, collaborative research is a powerful paradigm that can immensely strengthen the scientific process by integrating broad and diverse expertise. However, traditional research and multi-author writing processes break down at scale. We present new software named Manubot, available at https://manubot.org, to address the challenges of open scholarly writing. Manubot adopts the contribution workflow used by many large-scale open source software projects to enable collaborative authoring of scholarly manuscripts. With Manubot, manuscripts are written in Markdown and stored in a Git repository to precisely track changes over time. By hosting manuscript repositories publicly, such as on GitHub, multiple authors can simultaneously propose and review changes. A cloud service automatically evaluates proposed changes to catch errors. Publication with Manubot is continuous: When a manuscript's source changes, the rendered outputs are rebuilt and republished to a web page. Manubot automates bibliographic tasks by implementing citation by identifier, where users cite persistent identifiers (e.g. DOIs, PubMed IDs, ISBNs, URLs), whose metadata is then retrieved and converted to a user-specified style. Manubot modernizes publishing to align with the ideals of open science by making it transparent, reproducible, immediate, versioned, collaborative, and free of charge.

Subject(s)

Publishing , Software , Writing , Humans , Manuscripts, Medical as Topic

14.

Opportunities and obstacles for deep learning in biology and medicine.

Ching, Travers; Himmelstein, Daniel S; Beaulieu-Jones, Brett K; Kalinin, Alexandr A; Do, Brian T; Way, Gregory P; Ferrero, Enrico; Agapow, Paul-Michael; Zietz, Michael; Hoffman, Michael M; Xie, Wei; Rosen, Gail L; Lengerich, Benjamin J; Israeli, Johnny; Lanchantin, Jack; Woloszynek, Stephen; Carpenter, Anne E; Shrikumar, Avanti; Xu, Jinbo; Cofer, Evan M; Lavender, Christopher A; Turaga, Srinivas C; Alexandari, Amr M; Lu, Zhiyong; Harris, David J; DeCaprio, Dave; Qi, Yanjun; Kundaje, Anshul; Peng, Yifan; Wiley, Laura K; Segler, Marwin H S; Boca, Simina M; Swamidass, S Joshua; Huang, Austin; Gitter, Anthony; Greene, Casey S.

J R Soc Interface ; 15(141)2018 04.

Article in English | MEDLINE | ID: mdl-29618526

ABSTRACT

Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

Subject(s)

Biomedical Research/trends , Biomedical Technology/trends , Deep Learning/trends , Algorithms , Biomedical Research/methods , Decision Making , Delivery of Health Care/methods , Delivery of Health Care/trends , Disease/genetics , Drug Design , Electronic Health Records/trends , Humans , Terminology as Topic

15.

Sci-Hub provides access to nearly all scholarly literature.

Himmelstein, Daniel S; Romero, Ariel Rodriguez; Levernier, Jacob G; Munro, Thomas Anthony; McLaughlin, Stephen Reid; Greshake Tzovaras, Bastian; Greene, Casey S.

Elife ; 72018 03 01.

Article in English | MEDLINE | ID: mdl-29424689

ABSTRACT

The website Sci-Hub enables users to download PDF versions of scholarly articles, including many articles that are paywalled at their journal's site. Sci-Hub has grown rapidly since its creation in 2011, but the extent of its coverage has been unclear. Here we report that, as of March 2017, Sci-Hub's database contains 68.9% of the 81.6 million scholarly articles registered with Crossref and 85.1% of articles published in toll access journals. We find that coverage varies by discipline and publisher, and that Sci-Hub preferentially covers popular, paywalled content. For toll access articles, we find that Sci-Hub provides greater coverage than the University of Pennsylvania, a major research university in the United States. Green open access to toll access articles via licit services, on the other hand, remains quite limited. Our interactive browser at https://greenelab.github.io/scihub allows users to explore these findings in more detail. For the first time, nearly all scholarly literature is available gratis to anyone with an Internet connection, suggesting the toll access business model may become unsustainable.

Subject(s)

Access to Information , Databases, Bibliographic , Scholarly Communication , Bibliometrics , Internet , Pennsylvania

16.

Precision annotation of digital samples in NCBI's gene expression omnibus.

Hadley, Dexter; Pan, James; El-Sayed, Osama; Aljabban, Jihad; Aljabban, Imad; Azad, Tej D; Hadied, Mohamad O; Raza, Shuaib; Rayikanti, Benjamin Abhishek; Chen, Bin; Paik, Hyojung; Aran, Dvir; Spatz, Jordan; Himmelstein, Daniel; Panahiazar, Maryam; Bhattacharya, Sanchita; Sirota, Marina; Musen, Mark A; Butte, Atul J.

Sci Data ; 4: 170125, 2017 09 19.

Article in English | MEDLINE | ID: mdl-28925997

ABSTRACT

The Gene Expression Omnibus (GEO) contains more than two million digital samples from functional genomics experiments amassed over almost two decades. However, individual sample meta-data remains poorly described by unstructured free text attributes preventing its largescale reanalysis. We introduce the Search Tag Analyze Resource for GEO as a web application (http://STARGEO.org) to curate better annotations of sample phenotypes uniformly across different studies, and to use these sample annotations to define robust genomic signatures of disease pathology by meta-analysis. In this paper, we target a small group of biomedical graduate students to show rapid crowd-curation of precise sample annotations across all phenotypes, and we demonstrate the biological validity of these crowd-curated annotations for breast cancer. STARGEO.org makes GEO data findable, accessible, interoperable and reusable (i.e., FAIR) to ultimately facilitate knowledge discovery. Our work demonstrates the utility of crowd-curation and interpretation of open 'big data' under FAIR principles as a first step towards realizing an ideal paradigm of precision medicine.

Subject(s)

Data Curation , Databases, Genetic , Gene Expression , Humans

17.

Systematic integration of biomedical knowledge prioritizes drugs for repurposing.

Himmelstein, Daniel Scott; Lizee, Antoine; Hessler, Christine; Brueggeman, Leo; Chen, Sabrina L; Hadley, Dexter; Green, Ari; Khankhanian, Pouya; Baranzini, Sergio E.

Elife ; 62017 09 22.

Article in English | MEDLINE | ID: mdl-28936969

ABSTRACT

The ability to computationally predict whether a compound treats a disease would improve the economy and success rate of drug approval. This study describes Project Rephetio to systematically model drug efficacy based on 755 existing treatments. First, we constructed Hetionet (neo4j.het.io), an integrative network encoding knowledge from millions of biomedical studies. Hetionet v1.0 consists of 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. Data were integrated from 29 public resources to connect compounds, diseases, genes, anatomies, pathways, biological processes, molecular functions, cellular components, pharmacologic classes, side effects, and symptoms. Next, we identified network patterns that distinguish treatments from non-treatments. Then, we predicted the probability of treatment for 209,168 compound-disease pairs (het.io/repurpose). Our predictions validated on two external sets of treatment and provided pharmacological insights on epilepsy, suggesting they will help prioritize drug repurposing candidates. This study was entirely open and received realtime feedback from 40 community members.

Subject(s)

Computational Biology/methods , Drug Discovery/methods , Drug Repositioning/methods , Systems Biology/methods , Humans , Models, Biological

18.

Association of HLA Genetic Risk Burden With Disease Phenotypes in Multiple Sclerosis.

Isobe, Noriko; Keshavan, Anisha; Gourraud, Pierre-Antoine; Zhu, Alyssa H; Datta, Esha; Schlaeger, Regina; Caillier, Stacy J; Santaniello, Adam; Lizée, Antoine; Himmelstein, Daniel S; Baranzini, Sergio E; Hollenbach, Jill; Cree, Bruce A C; Hauser, Stephen L; Oksenberg, Jorge R; Henry, Roland G.

JAMA Neurol ; 73(7): 795-802, 2016 07 01.

Article in English | MEDLINE | ID: mdl-27244296

ABSTRACT

IMPORTANCE: Although multiple HLA alleles associated with multiple sclerosis (MS) risk have been identified, genotype-phenotype studies in the HLA region remain scarce and inconclusive. OBJECTIVES: To investigate whether MS risk-associated HLA alleles also affect disease phenotypes. DESIGN, SETTING, AND PARTICIPANTS: A cross-sectional, case-control study comprising 652 patients with MS who had comprehensive phenotypic information and 455 individuals of European origin serving as controls was conducted at a single academic research site. Patients evaluated at the Multiple Sclerosis Center at University of California, San Francisco between July 2004 and September 2005 were invited to participate. Spinal cord imaging in the data set was acquired between July 2013 and March 2014; analysis was performed between December 2014 and December 2015. MAIN OUTCOMES AND MEASURES: Cumulative HLA genetic burden (HLAGB) calculated using the most updated MS-associated HLA alleles vs clinical and magnetic resonance imaging outcomes, including age at onset, disease severity, conversion time from clinically isolated syndrome to clinically definite MS, fractions of cortical and subcortical gray matter and cerebral white matter, brain lesion volume, spinal cord gray and white matter areas, upper cervical cord area, and the ratio of gray matter to the upper cervical cord area. Multivariate modeling was applied separately for each sex data set. RESULTS: Of the 652 patients with MS, 586 had no missing genetic data and were included in the HLAGB analysis. In these 586 patients (404 women [68.9%]; mean [SD] age at disease onset, 33.6 [9.4] years), HLAGB was higher than in controls (median [IQR], 0.7 [0-1.4] and 0 [-0.3 to 0.5], respectively; P = 1.8 × 10-27). A total of 619 (95.8%) had relapsing-onset MS and 27 (4.2%) had progressive-onset MS. No significant difference was observed between relapsing-onset MS and primary progressive MS. A higher HLAGB was associated with younger age at onset and the atrophy of subcortical gray matter fraction in women with relapsing-onset MS (standard ß = -1.20 × 10-1; P = 1.7 × 10-2 and standard ß = -1.67 × 10-1; P = 2.3 × 10-4, respectively), which were driven mainly by the HLA-DRB1*15:01 haplotype. In addition, we observed the distinct role of the HLA-A*24:02-B*07:02-DRB1*15:01 haplotype among the other common DRB1*15:01 haplotypes and a nominally protective effect of HLA-B*44:02 to the subcortical gray atrophy (standard ß = -1.28 × 10-1; P = 5.1 × 10-3 and standard ß = 9.52 × 10-2; P = 3.6 × 10-2, respectively). CONCLUSIONS AND RELEVANCE: We confirm and extend previous observations linking HLA MS susceptibility alleles with disease progression and specific clinical and magnetic resonance imaging phenotypic traits.

Subject(s)

Genetic Predisposition to Disease/genetics , Histocompatibility Antigens Class I/genetics , Multiple Sclerosis/genetics , Polymorphism, Single Nucleotide/genetics , Adult , Age of Onset , Alleles , Brain/diagnostic imaging , Brain/pathology , Case-Control Studies , Cross-Sectional Studies , Female , Genetic Association Studies , Humans , Imaging, Three-Dimensional , Male , Middle Aged , Multiple Sclerosis/diagnostic imaging , Multiple Sclerosis/physiopathology , Retrospective Studies , Spinal Cord/diagnostic imaging , Spinal Cord/pathology , White People , Young Adult

19.

Genetic Association-Guided Analysis of Gene Networks for the Study of Complex Traits.

Greene, Casey S; Himmelstein, Daniel S.

Circ Cardiovasc Genet ; 9(2): 179-84, 2016 Apr.

Article in English | MEDLINE | ID: mdl-27094199

Subject(s)

Gene Regulatory Networks , Genetic Association Studies , Quantitative Trait, Heritable , Confounding Factors, Epidemiologic , Genomics , Humans , Polymorphism, Single Nucleotide/genetics

20.

Meta-analysis of genome-wide association studies reveals genetic overlap between Hodgkin lymphoma and multiple sclerosis.

Khankhanian, Pouya; Cozen, Wendy; Himmelstein, Daniel S; Madireddy, Lohith; Din, Lennox; van den Berg, Anke; Matsushita, Takuya; Glaser, Sally L; Moré, Jayaji M; Smedby, Karin E; Baranzini, Sergio E; Mack, Thomas M; Lizée, Antoine; de Sanjosé, Silvia; Gourraud, Pierre-Antoine; Nieters, Alexandra; Hauser, Stephen L; Cocco, Pierluigi; Maynadié, Marc; Foretová, Lenka; Staines, Anthony; Delahaye-Sourdeix, Manon; Li, Dalin; Bhatia, Smita; Melbye, Mads; Onel, Kenan; Jarrett, Ruth; McKay, James D; Oksenberg, Jorge R; Hjalgrim, Henrik.

Int J Epidemiol ; 45(3): 728-40, 2016 06.

Article in English | MEDLINE | ID: mdl-26971321

ABSTRACT

BACKGROUND: Based on epidemiological commonalities, multiple sclerosis (MS) and Hodgkin lymphoma (HL), two clinically distinct conditions, have long been suspected to be aetiologically related. MS and HL occur in roughly the same age groups, both are associated with Epstein-Barr virus infection and ultraviolet (UV) light exposure, and they cluster mutually in families (though not in individuals). We speculated if in addition to sharing environmental risk factors, MS and HL were also genetically related. Using data from genome-wide association studies (GWAS) of 1816 HL patients, 9772 MS patients and 25 255 controls, we therefore investigated the genetic overlap between the two diseases. METHODS: From among a common denominator of 404 K single nucleotide polymorphisms (SNPs) studied, we identified SNPs and human leukocyte antigen (HLA) alleles independently associated with both diseases. Next, we assessed the cumulative genome-wide effect of MS-associated SNPs on HL and of HL-associated SNPs on MS. To provide an interpretational frame of reference, we used data from published GWAS to create a genetic network of diseases within which we analysed proximity of HL and MS to autoimmune diseases and haematological and non-haematological malignancies. RESULTS: SNP analyses revealed genome-wide overlap between HL and MS, most prominently in the HLA region. Polygenic HL risk scores explained 4.44% of HL risk (Nagelkerke R(2)), but also 2.36% of MS risk. Conversely, polygenic MS risk scores explained 8.08% of MS risk and 1.94% of HL risk. In the genetic disease network, HL was closer to autoimmune diseases than to solid cancers. CONCLUSIONS: HL displays considerable genetic overlap with MS and other autoimmune diseases.

Subject(s)

Genome-Wide Association Study , Hodgkin Disease/genetics , Multiple Sclerosis/genetics , Polymorphism, Single Nucleotide , Female , Gene Regulatory Networks , Genetic Predisposition to Disease , Humans , Linear Models , Male

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL