Search | VHL Regional Portal

1.

Synthetic data generation for a longitudinal cohort study - evaluation, method extension and reproduction of published data analysis results.

Kühnel, Lisa; Schneider, Julian; Perrar, Ines; Adams, Tim; Moazemi, Sobhan; Prasser, Fabian; Nöthlings, Ute; Fröhlich, Holger; Fluck, Juliane.

Sci Rep ; 14(1): 14412, 2024 06 22.

Article in English | MEDLINE | ID: mdl-38909025

ABSTRACT

Access to individual-level health data is essential for gaining new insights and advancing science. In particular, modern methods based on artificial intelligence rely on the availability of and access to large datasets. In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data, i.e., data generated through a randomised process that have similar statistical properties as the original data, but do not have a one-to-one correspondence with the original individual-level records. In this study, we use a state-of-the-art synthetic data generation method and perform in-depth quality analyses of the generated data for a specific use case in the field of nutrition. We demonstrate the need for careful analyses of synthetic data that go beyond descriptive statistics and provide valuable insights into how to realise the full potential of synthetic datasets. By extending the methods, but also by thoroughly analysing the effects of sampling from a trained model, we are able to largely reproduce significant real-world analysis results in the chosen use case.

Subject(s)

Data Analysis , Humans , Longitudinal Studies , Artificial Intelligence

2.

An Annotation Workbench for Semantic Annotation of Data Collection Instruments.

Sasse, Julia; Fluck, Juliane.

Stud Health Technol Inform ; 302: 108-112, 2023 May 18.

Article in English | MEDLINE | ID: mdl-37203619

ABSTRACT

Semantic interoperability, i.e., the ability to automatically interpret the shared information in a meaningful way, is one of the most important requirements for data analysis of different sources. In the area of clinical and epidemiological studies, the target of the National Research Data Infrastructure for Personal Health Data (NFDI4Health), interoperability of data collection instruments such as case report forms (CRFs), data dictionaries and questionnaires is critical. Retrospective integration of semantic codes into study metadata at item-level is important, as ongoing or completed studies contain valuable information, which should be preserved. We present a first version of a Metadata Annotation Workbench to support annotators in dealing with a variety of complex terminologies and ontologies. User-driven development with users from the fields of nutritional epidemiology and chronic diseases ensured that the service fulfills the basic requirements for a semantic metadata annotation software for these NFDI4Health use cases. The web application can be accessed using a web browser and the source code of the software is available with an open-source MIT license.

Subject(s)

Semantics , Software , Retrospective Studies , Web Browser , Metadata

3.

SCAview: an Intuitive Visual Approach to the Integrative Analysis of Clinical Data in Spinocerebellar Ataxias.

Uebachs, Mischa; Wegner, Philipp; Schaaf, Sebastian; Kugai, Simon; Jacobi, Heike; Kuo, Sheng-Han; Ashizawa, Tetsuo; Fluck, Juliane; Klockgether, Thomas; Faber, Jennifer.

Cerebellum ; 2023 Mar 31.

Article in English | MEDLINE | ID: mdl-37002505

ABSTRACT

With SCAview, we present a prompt and comprehensive tool that enables scientists to browse large datasets of the most common spinocerebellar ataxias intuitively and without technical effort. Basic concept is a visualization of data, with a graphical handling and filtering to select and define subgroups and their comparison. Several plot types to visualize all data points resulting from the selected attributes are provided. The underlying synthetic cohort is based on clinical data from five different European and US longitudinal multicenter cohorts in spinocerebellar ataxia type 1, 2, 3, and 6 (SCA1, 2, 3, and 6) comprising > 1400 patients with overall > 5500 visits. First, we developed a common data model to integrate the clinical, demographic, and characterizing data of each source cohort. Second, the available datasets from each cohort were mapped onto the data model. Third, we created a synthetic cohort based on the cleaned dataset. With SCAview, we demonstrate the feasibility of mapping cohort data from different sources onto a common data model. The resulting browser-based visualization tool with a thoroughly graphical handling of the data offers researchers the unique possibility to visualize relationships and distributions of clinical data, to define subgroups and to further investigate them without any technical effort. Access to SCAview can be requested via the Ataxia Global Initiative and is free of charge.

4.

Critical assessment of transformer-based AI models for German clinical notes.

Lentzen, Manuel; Madan, Sumit; Lage-Rupprecht, Vanessa; Kühnel, Lisa; Fluck, Juliane; Jacobs, Marc; Mittermaier, Mirja; Witzenrath, Martin; Brunecker, Peter; Hofmann-Apitius, Martin; Weber, Joachim; Fröhlich, Holger.

JAMIA Open ; 5(4): ooac087, 2022 Dec.

Article in English | MEDLINE | ID: mdl-36380848

ABSTRACT

Objective: Healthcare data such as clinical notes are primarily recorded in an unstructured manner. If adequately translated into structured data, they can be utilized for health economics and set the groundwork for better individualized patient care. To structure clinical notes, deep-learning methods, particularly transformer-based models like Bidirectional Encoder Representations from Transformers (BERT), have recently received much attention. Currently, biomedical applications are primarily focused on the English language. While general-purpose German-language models such as GermanBERT and GottBERT have been published, adaptations for biomedical data are unavailable. This study evaluated the suitability of existing and novel transformer-based models for the German biomedical and clinical domain. Materials and Methods: We used 8 transformer-based models and pre-trained 3 new models on a newly generated biomedical corpus, and systematically compared them with each other. We annotated a new dataset of clinical notes and used it with 4 other corpora (BRONCO150, CLEF eHealth 2019 Task 1, GGPONC, and JSynCC) to perform named entity recognition (NER) and document classification tasks. Results: General-purpose language models can be used effectively for biomedical and clinical natural language processing (NLP) tasks, still, our newly trained BioGottBERT model outperformed GottBERT on both clinical NER tasks. However, training new biomedical models from scratch proved ineffective. Discussion: The domain-adaptation strategy's potential is currently limited due to a lack of pre-training data. Since general-purpose language models are only marginally inferior to domain-specific models, both options are suitable for developing German-language biomedical applications. Conclusion: General-purpose language models perform remarkably well on biomedical and clinical NLP tasks. If larger corpora become available in the future, domain-adapting these models may improve performances.

5.

We are not ready yet: limitations of state-of-the-art disease named entity recognizers.

Kühnel, Lisa; Fluck, Juliane.

J Biomed Semantics ; 13(1): 26, 2022 10 27.

Article in English | MEDLINE | ID: mdl-36303237

ABSTRACT

BACKGROUND: Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize. RESULTS: Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data. CONCLUSIONS: We argue that there is a need for larger annotated data sets for training and testing. Therefore, we foresee the curation of further data sets and, moreover, the investigation of continual learning processes for machine learning-based models.

Subject(s)

COVID-19 , Data Mining , Humans , Data Mining/methods , Natural Language Processing , Machine Learning

6.

Continuous development of the semantic search engine preVIEW: from COVID-19 to long COVID.

Langnickel, Lisa; Darms, Johannes; Heldt, Katharina; Ducks, Denise; Fluck, Juliane.

Database (Oxford) ; 20222022 07 01.

Article in English | MEDLINE | ID: mdl-35776071

ABSTRACT

preVIEW is a freely available semantic search engine for Coronavirus disease (COVID-19)-related preprint publications. Currently, it contains >43 800 documents indexed with >4000 semantic concepts, annotated automatically. During the last 2 years, the dynamic situation of the corona crisis has demanded dynamic development. Whereas new semantic concepts have been added over time-such as the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of interest-the service has been also extended with several features improving the usability and user friendliness. Most importantly, the user is now able to give feedback on detected semantic concepts, i.e. a user can mark annotations as true positives or false positives. In addition, we expanded our methods to construct search queries. The presented version of preVIEW also includes links to the peer-reviewed journal articles, if available. With the described system, we participated in the BioCreative VII interactive text-mining track and retrieved promising user-in-the-loop feedback. Additionally, as the occurrence of long-term symptoms after an infection with the virus SARS-CoV-2-called long COVID-is getting more and more attention, we have recently developed and incorporated a long COVID classifier based on state-of-the-art methods and manually curated data by experts. The service is freely accessible under https://preview.zbmed.de.

Subject(s)

COVID-19 , Search Engine , COVID-19/complications , COVID-19/epidemiology , Humans , SARS-CoV-2 , Semantics , Post-Acute COVID-19 Syndrome

7.

Integrative data semantics through a model-enabled data stewardship.

Wegner, Philipp; Schaaf, Sebastian; Uebachs, Mischa; Domingo-Fernández, Daniel; Salimi, Yasamin; Gebel, Stephan; Sargsyan, Astghik; Birkenbihl, Colin; Springstubbe, Stephan; Klockgether, Thomas; Fluck, Juliane; Hofmann-Apitius, Martin; Kodamullil, Alpha Tom.

Bioinformatics ; 38(15): 3850-3852, 2022 08 02.

Article in English | MEDLINE | ID: mdl-35652780

ABSTRACT

MOTIVATION: The importance of clinical data in understanding the pathophysiology of complex disorders has prompted the launch of multiple initiatives designed to generate patient-level data from various modalities. While these studies can reveal important findings relevant to the disease, each study captures different yet complementary aspects and modalities which, when combined, generate a more comprehensive picture of disease etiology. However, achieving this requires a global integration of data across studies, which proves to be challenging given the lack of interoperability of cohort datasets. RESULTS: Here, we present the Data Steward Tool (DST), an application that allows for semi-automatic semantic integration of clinical data into ontologies and global data models and data standards. We demonstrate the applicability of the tool in the field of dementia research by establishing a Clinical Data Model (CDM) in this domain. The CDM currently consists of 277 common variables covering demographics (e.g. age and gender), diagnostics, neuropsychological tests and biomarker measurements. The DST combined with this disease-specific data model shows how interoperability between multiple, heterogeneous dementia datasets can be achieved. AVAILABILITY AND IMPLEMENTATION: The DST source code and Docker images are respectively available at https://github.com/SCAI-BIO/data-steward and https://hub.docker.com/r/phwegner/data-steward. Furthermore, the DST is hosted at https://data-steward.bio.scai.fraunhofer.de/data-steward. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Dementia , Semantics , Humans , Software , Dementia/diagnosis

8.

Deep Learning-based detection of psychiatric attributes from German mental health records.

Madan, Sumit; Julius Zimmer, Fabian; Balabin, Helena; Schaaf, Sebastian; Fröhlich, Holger; Fluck, Juliane; Neuner, Irene; Mathiak, Klaus; Hofmann-Apitius, Martin; Sarkheil, Pegah.

Int J Med Inform ; 161: 104724, 2022 05.

Article in English | MEDLINE | ID: mdl-35279550

ABSTRACT

BACKGROUND: Health care records provide large amounts of data with real-world and longitudinal aspects, which is advantageous for predictive analyses and improvements in personalized medicine. Text-based records are a main source of information in mental health. Therefore, application of text mining to the electronic health records - especially mental state examination - is a key approach for detection of psychiatric disease phenotypes that relate to treatment outcomes. METHODS: We focused on the mental state examination (MSE) in the patients' discharge summaries as the key part of the psychiatric records. We prepared a sample of 150 text documents that we manually annotated for psychiatric attributes and symptoms. These documents were further divided into training and test sets. We designed and implemented a system to detect the psychiatric attributes automatically and linked the pathologically assessed attributes to AMDP terminology. This workflow uses a pre-trained neural network model, which is fine-tuned on the training set, and validated on the independent test set. Furthermore, a traditional NLP and rule-based component linked the recognized mentions to AMDP terminology. In a further step, we applied the system on a larger clinical dataset of 510 patients to extract their symptoms. RESULTS: The system identified the psychiatric attributes as well as their assessment (normal and pathological) and linked these entities to the AMDP terminology with an F1-score of 86% and 91% on an independent test set, respectively. CONCLUSION: The development of the current text mining system and the results highlight the feasibility of text mining methods applied to MSE in electronic mental health care reports. Our findings pave the way for the secondary use of routine data in the field of mental health, facilitating further clinical data analyses.

Subject(s)

Deep Learning , Mental Health , Data Mining/methods , Electronic Health Records , Humans , Natural Language Processing , Neural Networks, Computer

9.

Pre2Pub-Tracking the Path From Preprint to Journal Article: Algorithm Development and Validation.

Langnickel, Lisa; Podorskaja, Daria; Fluck, Juliane.

J Med Internet Res ; 24(4): e34072, 2022 04 08.

Article in English | MEDLINE | ID: mdl-35285808

ABSTRACT

BACKGROUND: The current COVID-19 crisis underscores the importance of preprints, as they allow for rapid communication of research results without delay in review. To fully integrate this type of publication into library information systems, we developed preview: a publicly available, central search engine for COVID-19-related preprints, which clearly distinguishes this source from peer-reviewed publications. The relationship between the preprint version and its corresponding journal version should be stored as metadata in both versions so that duplicates can be easily identified and information overload for researchers is reduced. OBJECTIVE: In this work, we investigated the extent to which the relationship information between preprint and corresponding journal publication is present in the published metadata, how it can be further completed, and how it can be used in preVIEW to identify already republished preprints and filter those duplicates in search results. METHODS: We first analyzed the information content available at the preprint servers themselves and the information that can be retrieved via Crossref. Moreover, we developed the algorithm Pre2Pub to find the corresponding reviewed article for each preprint. We integrated the results of those different resources into our search engine preVIEW, presented the information in the result set overview, and added filter options accordingly. RESULTS: Preprints have found their place in publication workflows; however, the link from a preprint to its corresponding journal publication is not completely covered in the metadata of the preprint servers or in Crossref. Our algorithm Pre2Pub is able to find approximately 16% more related journal articles with a precision of 99.27%. We also integrate this information in a transparent way within preVIEW so that researchers can use it in their search. CONCLUSIONS: Relationships between the preprint version and its journal version is valuable information that can help researchers finding only previously unknown information in preprints. As long as there is no transparent and complete way to store this relationship in metadata, the Pre2Pub algorithm is a suitable extension to retrieve this information.

Subject(s)

COVID-19 , Algorithms , Humans , Peer Review

10.

Improving the FAIRness of Health Studies in Germany: The German Central Health Study Hub COVID-19.

Darms, Johannes; Henke, Jörg; Hu, Xioaming; Schmidt, Carsten Oliver; Golebiewski, Martin; Fluck, Juliane.

Stud Health Technol Inform ; 287: 78-82, 2021 Nov 18.

Article in English | MEDLINE | ID: mdl-34795085

ABSTRACT

The German Central Health Study Hub COVID-19 is an online service that offers bundled access to COVID-19 related studies conducted in Germany. It combines metadata and other information of epidemiologic, public health and clinical studies into a single data repository for FAIR data access. In addition to study characteristics the system also allows easy access to study documents, as well as instruments for data collection. Study metadata and survey instruments are decomposed into individual data items and semantically enriched to ease the findability. Data from existing clinical trial registries (DRKS, clinicaltrails.gov and WHO ICTRP) are merged with epidemiological and public health studies manually collected and entered. More than 850 studies are listed as of September 2021.

Subject(s)

COVID-19 , Germany , Humans , Metadata , SARS-CoV-2 , Surveys and Questionnaires

11.

[Making COVID-19 research data more accessible-building a nationwide information infrastructure]. / COVID-19-Forschungsdaten leichter zugänglich machen Aufbau einer bundesweiten Informationsinfrastruktur.

Schmidt, Carsten Oliver; Fluck, Juliane; Golebiewski, Martin; Grabenhenrich, Linus; Hahn, Horst; Kirsten, Toralf; Klammt, Sebastian; Löbe, Matthias; Sax, Ulrich; Thun, Sylvia; Pigeot, Iris.

Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz ; 64(9): 1084-1092, 2021 Sep.

Article in German | MEDLINE | ID: mdl-34297162

ABSTRACT

Public health research and epidemiological and clinical studies are necessary to understand the COVID-19 pandemic and to take appropriate action. Therefore, since early 2020, numerous research projects have also been initiated in Germany. However, due to the large amount of information, it is currently difficult to get an overview of the diverse research activities and their results. Based on the "Federated research data infrastructure for personal health data" (NFDI4Health) initiative, the "COVID-19 task force" is able to create easier access to SARS-CoV-2- and COVID-19-related clinical, epidemiological, and public health research data. Therefore, the so-called FAIR data principles (findable, accessible, interoperable, reusable) are taken into account and should allow an expedited communication of results. The most essential work of the task force includes the generation of a study portal with metadata, selected instruments, other study documents, and study results as well as a search engine for preprint publications. Additional contents include a concept for the linkage between research and routine data, a service for an enhanced practice of image data, and the application of a standardized analysis routine for harmonized quality assessment. This infrastructure, currently being established, will facilitate the findability and handling of German COVID-19 research. The developments initiated in the context of the NFDI4Health COVID-19 task force are reusable for further research topics, as the challenges addressed are generic for the findability of and the handling with research data.

Subject(s)

Biomedical Research/trends , COVID-19 , Information Dissemination , Germany , Humans , Metadata , Pandemics , SARS-CoV-2

12.

Facilitating Study and Item Level Browsing for Clinical and Epidemiological COVID-19 Studies.

Schmidt, Carsten Oliver; Darms, Johannes; Shutsko, Aliaksandra; Löbe, Matthias; Nagrani, Rajini; Seifert, Bastian; Lindstädt, Birte; Golebiewski, Martin; Koleva, Sofiya; Bender, Theresa; Bauer, Christian Robert; Sax, Ulrich; Hu, Xiaoming; Lieser, Michael; Junker, Vivien; Klopfenstein, Sophie; Zeleke, Atinkut; Waltemath, Dagmar; Pigeot, Iris; Fluck, Juliane.

Stud Health Technol Inform ; 281: 794-798, 2021 May 27.

Article in English | MEDLINE | ID: mdl-34042687

ABSTRACT

COVID-19 poses a major challenge to individuals and societies around the world. Yet, it is difficult to obtain a good overview of studies across different medical fields of research such as clinical trials, epidemiology, and public health. Here, we describe a consensus metadata model to facilitate structured searches of COVID-19 studies and resources along with its implementation in three linked complementary web-based platforms. A relational database serves as central study metadata hub that secures compatibilities with common trials registries (e.g. ICTRP and standards like HL7 FHIR, CDISC ODM, and DataCite). The Central Search Hub was developed as a single-page application, the other two components with additional frontends are based on the SEEK platform and MICA, respectively. These platforms have different features concerning cohort browsing, item browsing, and access to documents and other study resources to meet divergent user needs. By this we want to promote transparent and harmonized COVID-19 research.

Subject(s)

COVID-19 , Epidemiologic Studies , Humans , Metadata , Registries , SARS-CoV-2

13.

COVID-19 preVIEW: Semantic Search to Explore COVID-19 Research Preprints.

Langnickel, Lisa; Baum, Roman; Darms, Johannes; Madan, Sumit; Fluck, Juliane.

Stud Health Technol Inform ; 281: 78-82, 2021 May 27.

Article in English | MEDLINE | ID: mdl-34042709

ABSTRACT

During the current COVID-19 pandemic, the rapid availability of profound information is crucial in order to derive information about diagnosis, disease trajectory, treatment or to adapt the rules of conduct in public. The increased importance of preprints for COVID-19 research initiated the design of the preprint search engine preVIEW. Conceptually, it is a lightweight semantic search engine focusing on easy inclusion of specialized COVID-19 textual collections and provides a user friendly web interface for semantic information retrieval. In order to support semantic search functionality, we integrated a text mining workflow for indexing with relevant terminologies. Currently, diseases, human genes and SARS-CoV-2 proteins are annotated, and more will be added in future. The system integrates collections from several different preprint servers that are used in the biomedical domain to publish non-peer-reviewed work, thereby enabling one central access point for the users. In addition, our service offers facet searching, export functionality and an API access. COVID-19 preVIEW is publicly available at https://preview.zbmed.de.

Subject(s)

COVID-19 , Humans , Pandemics , Publishing , SARS-CoV-2 , Semantics

14.

SEAweb: the small RNA Expression Atlas web application.

Rahman, Raza-Ur; Liebhoff, Anna-Maria; Bansal, Vikas; Fiosins, Maksims; Rajput, Ashish; Sattar, Abdul; Magruder, Daniel S; Madan, Sumit; Sun, Ting; Gautam, Abhivyakti; Heins, Sven; Liwinski, Timur; Bethune, Jörn; Trenkwalder, Claudia; Fluck, Juliane; Mollenhauer, Brit; Bonn, Stefan.

Nucleic Acids Res ; 48(D1): D204-D219, 2020 01 08.

Article in English | MEDLINE | ID: mdl-31598718

ABSTRACT

We present the Small RNA Expression Atlas (SEAweb), a web application that allows for the interactive querying, visualization and analysis of known and novel small RNAs across 10 organisms. It contains sRNA and pathogen expression information for over 4200 published samples with standardized search terms and ontologies. In addition, SEAweb allows for the interactive visualization and re-analysis of 879 differential expression and 514 classification comparisons. SEAweb's user model enables sRNA researchers to compare and re-analyze user-specific and published datasets, highlighting common and distinct sRNA expression patterns. We provide evidence for SEAweb's fidelity by (i) generating a set of 591 tissue specific miRNAs across 29 tissues, (ii) finding known and novel bacterial and viral infections across diseases and (iii) determining a Parkinson's disease-specific blood biomarker signature using novel data. We believe that SEAweb's simple semantic search interface, the flexible interactive reports and the user model with rich analysis capabilities will enable researchers to better understand the potential function and diagnostic value of sRNAs or pathogens across tissues, diseases and organisms.

Subject(s)

Databases, Nucleic Acid , RNA, Small Untranslated/metabolism , Animals , Bacterial Infections/microbiology , Cattle , Humans , Internet , Mice , Organ Specificity , Parkinson Disease/blood , RNA, Bacterial/metabolism , RNA, Viral/metabolism , Rats , Virus Diseases/virology

15.

The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track.

Madan, Sumit; Szostak, Justyna; Komandur Elayavilli, Ravikumar; Tsai, Richard Tzong-Han; Ali, Mehdi; Qian, Longhua; Rastegar-Mojarad, Majid; Hoeng, Julia; Fluck, Juliane.

Database (Oxford) ; 20192019 01 01.

Article in English | MEDLINE | ID: mdl-31603193

ABSTRACT

Knowledge of the molecular interactions of biological and chemical entities and their involvement in biological processes or clinical phenotypes is important for data interpretation. Unfortunately, this knowledge is mostly embedded in the literature in such a way that it is unavailable for automated data analysis procedures. Biological expression language (BEL) is a syntax representation allowing for the structured representation of a broad range of biological relationships. It is used in various situations to extract such knowledge and transform it into BEL networks. To support the tedious and time-intensive extraction work of curators with automated methods, we developed the BEL track within the framework of BioCreative Challenges. Within the BEL track, we provide training data and an evaluation environment to encourage the text mining community to tackle the automatic extraction of complex BEL relationships. In 2017 BioCreative VI, the 2015 BEL track was repeated with new test data. Although only minor improvements in text snippet retrieval for given statements were achieved during this second BEL task iteration, a significant increase of BEL statement extraction performance from provided sentences could be seen. The best performing system reached a 32% F-score for the extraction of complete BEL statements and with the given named entities this increased to 49%. This time, besides rule-based systems, new methods involving hierarchical sequence labeling and neural networks were applied for BEL statement extraction.

Subject(s)

Data Mining , Databases, Factual , Neural Networks, Computer , Vocabulary, Controlled

16.

The BEL information extraction workflow (BELIEF): evaluation in the BioCreative V BEL and IAT track.

Madan, Sumit; Hodapp, Sven; Senger, Philipp; Ansari, Sam; Szostak, Justyna; Hoeng, Julia; Peitsch, Manuel; Fluck, Juliane.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-27694210

ABSTRACT

Network-based approaches have become extremely important in systems biology to achieve a better understanding of biological mechanisms. For network representation, the Biological Expression Language (BEL) is well designed to collate findings from the scientific literature into biological network models. To facilitate encoding and biocuration of such findings in BEL, a BEL Information Extraction Workflow (BELIEF) was developed. BELIEF provides a web-based curation interface, the BELIEF Dashboard, that incorporates text mining techniques to support the biocurator in the generation of BEL networks. The underlying UIMA-based text mining pipeline (BELIEF Pipeline) uses several named entity recognition processes and relationship extraction methods to detect concepts and BEL relationships in literature. The BELIEF Dashboard allows easy curation of the automatically generated BEL statements and their context annotations. Resulting BEL statements and their context annotations can be syntactically and semantically verified to ensure consistency in the BEL network. In summary, the workflow supports experts in different stages of systems biology network building. Based on the BioCreative V BEL track evaluation, we show that the BELIEF Pipeline automatically extracts relationships with an F-score of 36.4% and fully correct statements can be obtained with an F-score of 30.8%. Participation in the BioCreative V Interactive task (IAT) track with BELIEF revealed a systems usability scale (SUS) of 67. Considering the complexity of the task for new users-learning BEL, working with a completely new interface, and performing complex curation-a score so close to the overall SUS average highlights the usability of BELIEF.Database URL: BELIEF is available at http://www.scaiview.com/belief/.

Subject(s)

Data Mining/methods , Machine Learning , Models, Biological , Programming Languages

17.

Overview of the interactive task in BioCreative V.

Wang, Qinghua; S Abdul, Shabbir; Almeida, Lara; Ananiadou, Sophia; Balderas-Martínez, Yalbi I; Batista-Navarro, Riza; Campos, David; Chilton, Lucy; Chou, Hui-Jou; Contreras, Gabriela; Cooper, Laurel; Dai, Hong-Jie; Ferrell, Barbra; Fluck, Juliane; Gama-Castro, Socorro; George, Nancy; Gkoutos, Georgios; Irin, Afroza K; Jensen, Lars J; Jimenez, Silvia; Jue, Toni R; Keseler, Ingrid; Madan, Sumit; Matos, Sérgio; McQuilton, Peter; Milacic, Marija; Mort, Matthew; Natarajan, Jeyakumar; Pafilis, Evangelos; Pereira, Emiliano; Rao, Shruti; Rinaldi, Fabio; Rothfels, Karen; Salgado, David; Silva, Raquel M; Singh, Onkar; Stefancsik, Raymund; Su, Chu-Hsien; Subramani, Suresh; Tadepally, Hamsa D; Tsaprouni, Loukia; Vasilevsky, Nicole; Wang, Xiaodong; Chatr-Aryamontri, Andrew; Laulederkind, Stanley J F; Matis-Mitchell, Sherri; McEntyre, Johanna; Orchard, Sandra; Pundir, Sangya; Rodriguez-Esteban, Raul.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-27589961

ABSTRACT

Fully automated text mining (TM) systems promote efficient literature searching, retrieval, and review but are not sufficient to produce ready-to-consume curated documents. These systems are not meant to replace biocurators, but instead to assist them in one or more literature curation steps. To do so, the user interface is an important aspect that needs to be considered for tool adoption. The BioCreative Interactive task (IAT) is a track designed for exploring user-system interactions, promoting development of useful TM tools, and providing a communication channel between the biocuration and the TM communities. In BioCreative V, the IAT track followed a format similar to previous interactive tracks, where the utility and usability of TM tools, as well as the generation of use cases, have been the focal points. The proposed curation tasks are user-centric and formally evaluated by biocurators. In BioCreative V IAT, seven TM systems and 43 biocurators participated. Two levels of user participation were offered to broaden curator involvement and obtain more feedback on usability aspects. The full level participation involved training on the system, curation of a set of documents with and without TM assistance, tracking of time-on-task, and completion of a user survey. The partial level participation was designed to focus on usability aspects of the interface and not the performance per se In this case, biocurators navigated the system by performing pre-designed tasks and then were asked whether they were able to achieve the task and the level of difficulty in completing the task. In this manuscript, we describe the development of the interactive task, from planning to execution and discuss major findings for the systems tested.Database URL: http://www.biocreative.org.

Subject(s)

Data Curation/methods , Data Mining/methods , Electronic Data Processing/methods

18.

Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL).

Fluck, Juliane; Madan, Sumit; Ansari, Sam; Kodamullil, Alpha T; Karki, Reagon; Rastegar-Mojarad, Majid; Catlett, Natalie L; Hayes, William; Szostak, Justyna; Hoeng, Julia; Peitsch, Manuel.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-27554092

ABSTRACT

Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types 'increases' and 'decreases'. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task.Database URL: http://wiki.openbel.org/display/BIOC/Datasets.

Subject(s)

Data Curation/methods , Data Mining/methods , Natural Language Processing , Animals , Humans , Mice

19.

BioCreative V track 4: a shared task for the extraction of causal network information using the Biological Expression Language.

Rinaldi, Fabio; Ellendorff, Tilia Renate; Madan, Sumit; Clematide, Simon; van der Lek, Adrian; Mevissen, Theo; Fluck, Juliane.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-27402677

ABSTRACT

Automatic extraction of biological network information is one of the most desired and most complex tasks in biological and medical text mining. Track 4 at BioCreative V attempts to approach this complexity using fragments of large-scale manually curated biological networks, represented in Biological Expression Language (BEL), as training and test data. BEL is an advanced knowledge representation format which has been designed to be both human readable and machine processable. The specific goal of track 4 was to evaluate text mining systems capable of automatically constructing BEL statements from given evidence text, and of retrieving evidence text for given BEL statements. Given the complexity of the task, we designed an evaluation methodology which gives credit to partially correct statements. We identified various levels of information expressed by BEL statements, such as entities, functions, relations, and introduced an evaluation framework which rewards systems capable of delivering useful BEL fragments at each of these levels. The aim of this evaluation method is to help identify the characteristics of the systems which, if combined, would be most useful for achieving the overall goal of automatically constructing causal biological networks from text.

Subject(s)

Data Mining/methods , Databases, Factual , Programming Languages , Humans

20.

Towards a Pathway Inventory of the Human Brain for Modeling Disease Mechanisms Underlying Neurodegeneration.

Iyappan, Anandhi; Gündel, Michaela; Shahid, Mohammad; Wang, Jiali; Li, Hui; Mevissen, Heinz-Theodor; Müller, Bernd; Fluck, Juliane; Jirsa, Viktor; Domide, Lia; Younesi, Erfan; Hofmann-Apitius, Martin.

J Alzheimers Dis ; 52(4): 1343-60, 2016 04 12.

Article in English | MEDLINE | ID: mdl-27079715

ABSTRACT

Molecular signaling pathways have been long used to demonstrate interactions among upstream causal molecules and downstream biological effects. They show the signal flow between cell compartments, the majority of which are represented as cartoons. These are often drawn manually by scanning through the literature, which is time-consuming, static, and non-interoperable. Moreover, these pathways are often devoid of context (condition and tissue) and biased toward certain disease conditions. Mining the scientific literature creates new possibilities to retrieve pathway information at higher contextual resolution and specificity. To address this challenge, we have created a pathway terminology system by combining signaling pathways and biological events to ensure a broad coverage of the entire pathway knowledge domain. This terminology was applied to mining biomedical papers and patents about neurodegenerative diseases with focus on Alzheimer's disease. We demonstrate the power of our approach by mapping literature-derived signaling pathways onto their corresponding anatomical regions in the human brain under healthy and Alzheimer's disease states. We demonstrate how this knowledge resource can be used to identify a putative mechanism explaining the mode-of-action of the approved drug Rasagiline, and show how this resource can be used for fingerprinting patents to support the discovery of pathway knowledge for Alzheimer's disease. Finally, we propose that based on next-generation cause-and-effect pathway models, a dedicated inventory of computer-processable pathway models specific to neurodegenerative diseases can be established, which hopefully accelerates context-specific enrichment analysis of experimental data with higher resolution and richer annotations.

Subject(s)

Brain/metabolism , Models, Neurological , Neurodegenerative Diseases/metabolism , Signal Transduction/physiology , Brain/drug effects , Brain/physiopathology , Databases, Factual , Humans , Metabolic Networks and Pathways/physiology , Neurodegenerative Diseases/physiopathology , Terminology as Topic

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL