Search | VHL Regional Portal

1.

Evaluating FAIR Digital Object and Linked Data as distributed object systems.

Soiland-Reyes, Stian; Goble, Carole; Groth, Paul.

PeerJ Comput Sci ; 10: e1781, 2024.

Article in English | MEDLINE | ID: mdl-38855229

ABSTRACT

FAIR Digital Object (FDO) is an emerging concept that is highlighted by European Open Science Cloud (EOSC) as a potential candidate for building an ecosystem of machine-actionable research outputs. In this work we systematically evaluate FDO and its implementations as a global distributed object system, by using five different conceptual frameworks that cover interoperability, middleware, FAIR principles, EOSC requirements and FDO guidelines themself. We compare the FDO approach with established Linked Data practices and the existing Web architecture, and provide a brief history of the Semantic Web while discussing why these technologies may have been difficult to adopt for FDO purposes. We conclude with recommendations for both Linked Data and FDO communities to further their adaptation and alignment.

2.

BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs.

Daza, Daniel; Alivanistos, Dimitrios; Mitra, Payal; Pijnenburg, Thom; Cochez, Michael; Groth, Paul.

J Biomed Semantics ; 14(1): 20, 2023 Dec 08.

Article in English | MEDLINE | ID: mdl-38066573

ABSTRACT

BACKGROUND: Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. OBJECTIVE: We aim to understand how to incorporate multimodal data into biomedical KG embeddings, and analyze the resulting performance in comparison with traditional methods. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. RESULTS: In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. Further analyses show that incorporating attribute data does outperform baselines over entities below a certain node degree, comprising approximately 75% of the diseases in the graph. We also observe that optimizing attribute encoders is a challenging task that increases optimization costs. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. CONCLUSION: BioBLP allows to investigate different ways of incorporating multimodal biomedical data for learning representations in KGs. With a particular implementation, we find that incorporating attribute data does not consistently outperform baselines, but improvements are obtained on a comparatively large subset of entities below a specific node-degree. Our results indicate a potential for improved performance in scientific discovery tasks where understudied areas of the KG would benefit from link prediction methods.

Subject(s)

Pattern Recognition, Automated

3.

Relational graph convolutional networks: a closer look.

Thanapalasingam, Thiviyan; van Berkel, Lucas; Bloem, Peter; Groth, Paul.

PeerJ Comput Sci ; 8: e1073, 2022.

Article in English | MEDLINE | ID: mdl-36426239

ABSTRACT

In this article, we describe a reproduction of the Relational Graph Convolutional Network (RGCN). Using our reproduction, we explain the intuition behind the model. Our reproduction results empirically validate the correctness of our implementations using benchmark Knowledge Graph datasets on node classification and link prediction tasks. Our explanation provides a friendly understanding of the different components of the RGCN for both users and researchers extending the RGCN approach. Furthermore, we introduce two new configurations of the RGCN that are more parameter efficient. The code and datasets are available at https://github.com/thiviyanT/torch-rgcn.

4.

Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation.

Schröder, Max; Staehlke, Susanne; Groth, Paul; Nebe, J Barbara; Spors, Sascha; Krüger, Frank.

J Biomed Semantics ; 13(1): 4, 2022 01 31.

Article in English | MEDLINE | ID: mdl-35101121

ABSTRACT

BACKGROUND: Electronic Laboratory Notebooks (ELNs) are used to document experiments and investigations in the wet-lab. Protocols in ELNs contain a detailed description of the conducted steps including the necessary information to understand the procedure and the raised research data as well as to reproduce the research investigation. The purpose of this study is to investigate whether such ELN protocols can be used to create semantic documentation of the provenance of research data by the use of ontologies and linked data methodologies. METHODS: Based on an ELN protocol of a biomedical wet-lab experiment, a retrospective provenance model of the raised research data describing the details of the experiment in a machine-interpretable way is manually engineered. Furthermore, an automated approach for knowledge acquisition from ELN protocols is derived from these results. This structure-based approach exploits the structure in the experiment's description such as headings, tables, and links, to translate the ELN protocol into a semantic knowledge representation. To satisfy the Findable, Accessible, Interoperable, and Reuseable (FAIR) guiding principles, a ready-to-publish bundle is created that contains the research data together with their semantic documentation. RESULTS: While the manual modelling efforts serve as proof of concept by employing one protocol, the automated structure-based approach demonstrates the potential generalisation with seven ELN protocols. For each of those protocols, a ready-to-publish bundle is created and, by employing the SPARQL query language, it is illustrated that questions about the processes and the obtained research data can be answered. CONCLUSIONS: The semantic documentation of research data obtained from the ELN protocols allows for the representation of the retrospective provenance of research data in a machine-interpretable way. Research Object Crate (RO-Crate) bundles including these models enable researchers to easily share the research data including the corresponding documentation, but also to search and relate the experiment to each other.

Subject(s)

Documentation , Knowledge Bases , Documentation/methods , Electronics , Retrospective Studies , Semantic Web

5.

The non-linear impact of data handling on network diffusion models.

Nevin, James; Lees, Michael; Groth, Paul.

Patterns (N Y) ; 2(12): 100397, 2021 Dec 10.

Article in English | MEDLINE | ID: mdl-34950910

ABSTRACT

Many computational models rely on real-world data, and the steps required in moving from data collection, to data preparation, to model calibration, and input are becoming increasingly complex. Errors in data can lead to errors in model output that might invalidate conclusions in extreme cases. While the challenge of errors in data collection have been analyzed in the literature, here we highlight the importance of data handling in the modeling and simulation process, and how particular data handling errors can lead to errors in model output. We develop a framework for assessing the impact of potential data errors for models of spreading processes on networks, a broad class of models that capture many important real-world phenomena (e.g., epidemics, rumor spread, etc.). We focus on the susceptible-infected-removed (SIR) and Threshold models and examine how systematic errors in data handling impact the predicted spread of a virus (or information). Our results demonstrate that data handling errors can have significant impact on model conclusions especially in critical regions of a system.

6.

Perspectives on automated composition of workflows in the life sciences.

Lamprecht, Anna-Lena; Palmblad, Magnus; Ison, Jon; Schwämmle, Veit; Al Manir, Mohammad Sadnan; Altintas, Ilkay; Baker, Christopher J O; Ben Hadj Amor, Ammar; Capella-Gutierrez, Salvador; Charonyktakis, Paulos; Crusoe, Michael R; Gil, Yolanda; Goble, Carole; Griffin, Timothy J; Groth, Paul; Ienasescu, Hans; Jagtap, Pratik; Kalas, Matús; Kasalica, Vedran; Khanteymoori, Alireza; Kuhn, Tobias; Mei, Hailiang; Ménager, Hervé; Möller, Steffen; Richardson, Robin A; Robert, Vincent; Soiland-Reyes, Stian; Stevens, Robert; Szaniszlo, Szoke; Verberne, Suzan; Verhoeven, Aswin; Wolstencroft, Katherine.

F1000Res ; 10: 897, 2021.

Article in English | MEDLINE | ID: mdl-34804501

ABSTRACT

Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences. We survey previous initiatives to automate the composition process, and discuss the current state of the art and future perspectives. We start by drawing the "big picture" of the scientific workflow development life cycle, before surveying and discussing current methods, technologies and practices for semantic domain modelling, automation in workflow development, and workflow assessment. Finally, we derive a roadmap of individual and community-based actions to work toward the vision of automated workflow development in the forthcoming years. A central outcome of the workshop is a general description of the workflow life cycle in six stages: 1) scientific question or hypothesis, 2) conceptual workflow, 3) abstract workflow, 4) concrete workflow, 5) production workflow, and 6) scientific results. The transitions between stages are facilitated by diverse tools and methods, usually incorporating domain knowledge in some form. Formal semantic domain modelling is hard and often a bottleneck for the application of semantic technologies. However, life science communities have made considerable progress here in recent years and are continuously improving, renewing interest in the application of semantic technologies for workflow exploration, composition and instantiation. Combined with systematic benchmarking with reference data and large-scale deployment of production-stage workflows, such technologies enable a more systematic process of workflow development than we know today. We believe that this can lead to more robust, reusable, and sustainable workflows in the future.

Subject(s)

Biological Science Disciplines , Computational Biology , Benchmarking , Software , Workflow

7.

Dataset Reuse: Toward Translating Principles to Practice.

Koesten, Laura; Vougiouklis, Pavlos; Simperl, Elena; Groth, Paul.

Patterns (N Y) ; 1(8): 100136, 2020 Nov 13.

Article in English | MEDLINE | ID: mdl-33294873

ABSTRACT

The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse.

8.

Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines.

Gregory, Kathleen; Groth, Paul; Cousijn, Helena; Scharnhorst, Andrea; Wyatt, Sally.

J Assoc Inf Sci Technol ; 70(5): 419-432, 2019 May.

Article in English | MEDLINE | ID: mdl-31763358

ABSTRACT

A cross-disciplinary examination of the user behaviors involved in seeking and evaluating data is surprisingly absent from the research data discussion. This review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data in selected disciplines. Two analytical frameworks, rooted in information retrieval and science and technology studies, are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.

9.

Addendum: The FAIR Guiding Principles for scientific data management and stewardship.

Wilkinson, Mark D; Dumontier, Michel; Jan Aalbersberg, Ijsbrand; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem; da Silva Santos, Luiz Bonino; Bourne, Philip E; Bouwman, Jildau; Brookes, Anthony J; Clark, Tim; Crosas, Mercè; Dillo, Ingrid; Dumon, Olivier; Edmunds, Scott; Evelo, Chris T; Finkers, Richard; Gonzalez-Beltran, Alejandra; Gray, Alasdair J G; Groth, Paul; Goble, Carole; Grethe, Jeffrey S; Heringa, Jaap; Hoen, Peter A C 't; Hooft, Rob; Kuhn, Tobias; Kok, Ruben; Kok, Joost; Lusher, Scott J; Martone, Maryann E; Mons, Albert; Packer, Abel L; Persson, Bengt; Rocca-Serra, Philippe; Roos, Marco; van Schaik, Rene; Sansone, Susanna-Assunta; Schultes, Erik; Sengstag, Thierry; Slater, Ted; Strawn, George; Swertz, Morris A; Thompson, Mark; van der Lei, Johan; van Mulligen, Erik; Waagmeester, Andra; Wittenburg, Peter; Wolstencroft, Katherine.

Sci Data ; 6(1): 6, 2019 03 19.

Article in English | MEDLINE | ID: mdl-30890711

10.

Indicators for the use of robotic labs in basic biomedical research: a literature analysis.

Groth, Paul; Cox, Jessica.

PeerJ ; 5: e3997, 2017.

Article in English | MEDLINE | ID: mdl-29134146

ABSTRACT

Robotic labs, in which experiments are carried out entirely by robots, have the potential to provide a reproducible and transparent foundation for performing basic biomedical laboratory experiments. In this article, we investigate whether these labs could be applicable in current experimental practice. We do this by text mining 1,628 papers for occurrences of methods that are supported by commercial robotic labs. Using two different concept recognition tools, we find that 86%-89% of the papers have at least one of these methods. This and our other results provide indications that robotic labs can serve as the foundation for performing many lab-based experiments.

11.

The health care and life sciences community profile for dataset descriptions.

Dumontier, Michel; Gray, Alasdair J G; Marshall, M Scott; Alexiev, Vladimir; Ansell, Peter; Bader, Gary; Baran, Joachim; Bolleman, Jerven T; Callahan, Alison; Cruz-Toledo, José; Gaudet, Pascale; Gombocz, Erich A; Gonzalez-Beltran, Alejandra N; Groth, Paul; Haendel, Melissa; Ito, Maori; Jupp, Simon; Juty, Nick; Katayama, Toshiaki; Kobayashi, Norio; Krishnaswami, Kalpana; Laibe, Camille; Le Novère, Nicolas; Lin, Simon; Malone, James; Miller, Michael; Mungall, Christopher J; Rietveld, Laurens; Wimalaratne, Sarala M; Yamaguchi, Atsuko.

PeerJ ; 4: e2331, 2016.

Article in English | MEDLINE | ID: mdl-27602295

ABSTRACT

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.

12.

The FAIR Guiding Principles for scientific data management and stewardship.

Wilkinson, Mark D; Dumontier, Michel; Aalbersberg, I Jsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem; da Silva Santos, Luiz Bonino; Bourne, Philip E; Bouwman, Jildau; Brookes, Anthony J; Clark, Tim; Crosas, Mercè; Dillo, Ingrid; Dumon, Olivier; Edmunds, Scott; Evelo, Chris T; Finkers, Richard; Gonzalez-Beltran, Alejandra; Gray, Alasdair J G; Groth, Paul; Goble, Carole; Grethe, Jeffrey S; Heringa, Jaap; 't Hoen, Peter A C; Hooft, Rob; Kuhn, Tobias; Kok, Ruben; Kok, Joost; Lusher, Scott J; Martone, Maryann E; Mons, Albert; Packer, Abel L; Persson, Bengt; Rocca-Serra, Philippe; Roos, Marco; van Schaik, Rene; Sansone, Susanna-Assunta; Schultes, Erik; Sengstag, Thierry; Slater, Ted; Strawn, George; Swertz, Morris A; Thompson, Mark; van der Lei, Johan; van Mulligen, Erik; Velterop, Jan; Waagmeester, Andra; Wittenburg, Peter.

Sci Data ; 3: 160018, 2016 Mar 15.

Article in English | MEDLINE | ID: mdl-26978244

ABSTRACT

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders-representing academia, industry, funding agencies, and scholarly publishers-have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

Subject(s)

Data Collection , Data Curation , Research Design , Database Management Systems , Guidelines as Topic , Reproducibility of Results

13.

Drug discovery FAQs: workflows for answering multidomain drug discovery questions.

Chichester, Christine; Digles, Daniela; Siebes, Ronald; Loizou, Antonis; Groth, Paul; Harland, Lee.

Drug Discov Today ; 20(4): 399-405, 2015 Apr.

Article in English | MEDLINE | ID: mdl-25463038

ABSTRACT

Modern data-driven drug discovery requires integrated resources to support decision-making and enable new discoveries. The Open PHACTS Discovery Platform (http://dev.openphacts.org) was built to address this requirement by focusing on drug discovery questions that are of high priority to the pharmaceutical industry. Although complex, most of these frequently asked questions (FAQs) revolve around the combination of data concerning compounds, targets, pathways and diseases. Computational drug discovery using workflow tools and the integrated resources of Open PHACTS can deliver answers to most of these questions. Here, we report on a selection of workflows used for solving these use cases and discuss some of the research challenges. The workflows are accessible online from myExperiment (http://www.myexperiment.org) and are available for reuse by the scientific community.

Subject(s)

Computational Biology , Databases, Chemical , Databases, Pharmaceutical , Decision Support Techniques , Drug Discovery/methods , Pharmaceutical Preparations/chemistry , Workflow , Access to Information , Data Mining , Humans , Molecular Structure , Signal Transduction/drug effects , Structure-Activity Relationship , Systems Integration

14.

Ten simple rules for the care and feeding of scientific data.

Goodman, Alyssa; Pepe, Alberto; Blocker, Alexander W; Borgman, Christine L; Cranmer, Kyle; Crosas, Merce; Di Stefano, Rosanne; Gil, Yolanda; Groth, Paul; Hedstrom, Margaret; Hogg, David W; Kashyap, Vinay; Mahabal, Ashish; Siemiginowska, Aneta; Slavkovic, Aleksandra.

PLoS Comput Biol ; 10(4): e1003542, 2014 Apr.

Article in English | MEDLINE | ID: mdl-24763340

Subject(s)

Data Interpretation, Statistical , Guidelines as Topic

15.

The altmetrics collection.

Priem, Jason; Groth, Paul; Taraborelli, Dario.

PLoS One ; 7(11): e48753, 2012.

Article in English | MEDLINE | ID: mdl-23133655

Subject(s)

Biomedical Research/trends , Publications , Humans , Internet , Social Media , Statistics as Topic

16.

Open PHACTS: semantic interoperability for drug discovery.

Williams, Antony J; Harland, Lee; Groth, Paul; Pettifer, Stephen; Chichester, Christine; Willighagen, Egon L; Evelo, Chris T; Blomberg, Niklas; Ecker, Gerhard; Goble, Carole; Mons, Barend.

Drug Discov Today ; 17(21-22): 1188-98, 2012 Nov.

Article in English | MEDLINE | ID: mdl-22683805

ABSTRACT

Open PHACTS is a public-private partnership between academia, publishers, small and medium sized enterprises and pharmaceutical companies. The goal of the project is to deliver and sustain an 'open pharmacological space' using and enhancing state-of-the-art semantic web standards and technologies. It is focused on practical and robust applications to solve specific questions in drug discovery research. OPS is intended to facilitate improvements in drug discovery in academia and industry and to support open innovation and in-house non-public drug discovery research. This paper lays out the challenges and how the Open PHACTS project is hoping to address these challenges technically and socially.

Subject(s)

Drug Discovery/organization & administration , Drug Industry/organization & administration , Public-Private Sector Partnerships/organization & administration , Drug Design , Humans , Information Storage and Retrieval/methods , Internet , Organizational Innovation , Research/organization & administration , Semantics

17.

The value of data.

Mons, Barend; van Haagen, Herman; Chichester, Christine; Hoen, Peter-Bram 't; den Dunnen, Johan T; van Ommen, Gertjan; van Mulligen, Erik; Singh, Bharat; Hooft, Rob; Roos, Marco; Hammond, Joel; Kiesel, Bruce; Giardine, Belinda; Velterop, Jan; Groth, Paul; Schultes, Erik.

Nat Genet ; 43(4): 281-3, 2011 Mar 29.

Article in English | MEDLINE | ID: mdl-21445068

ABSTRACT

Data citation and the derivation of semantic constructs directly from datasets have now both found their place in scientific communication. The social challenge facing us is to maintain the value of traditional narrative publications and their relationship to the datasets they report upon while at the same time developing appropriate metrics for citation of data and data constructs.

Subject(s)

Databases, Genetic , Communication , Genetic Variation , Humans , Knowledge Bases , Publishing

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL