Search | VHL Regional Portal

Assessing Privacy Vulnerabilities in Genetic Data Sets: Scoping Review.

Thomas, Mara; Mackes, Nuria; Preuss-Dodhy, Asad; Wieland, Thomas; Bundschus, Markus.

JMIR Bioinform Biotechnol ; 5: e54332, 2024 May 27.

Article in English | MEDLINE | ID: mdl-38935957

ABSTRACT

BACKGROUND: Genetic data are widely considered inherently identifiable. However, genetic data sets come in many shapes and sizes, and the feasibility of privacy attacks depends on their specific content. Assessing the reidentification risk of genetic data is complex, yet there is a lack of guidelines or recommendations that support data processors in performing such an evaluation. OBJECTIVE: This study aims to gain a comprehensive understanding of the privacy vulnerabilities of genetic data and create a summary that can guide data processors in assessing the privacy risk of genetic data sets. METHODS: We conducted a 2-step search, in which we first identified 21 reviews published between 2017 and 2023 on the topic of genomic privacy and then analyzed all references cited in the reviews (n=1645) to identify 42 unique original research studies that demonstrate a privacy attack on genetic data. We then evaluated the type and components of genetic data exploited for these attacks as well as the effort and resources needed for their implementation and their probability of success. RESULTS: From our literature review, we derived 9 nonmutually exclusive features of genetic data that are both inherent to any genetic data set and informative about privacy risk: biological modality, experimental assay, data format or level of processing, germline versus somatic variation content, content of single nucleotide polymorphisms, short tandem repeats, aggregated sample measures, structural variants, and rare single nucleotide variants. CONCLUSIONS: On the basis of our literature review, the evaluation of these 9 features covers the great majority of privacy-critical aspects of genetic data and thus provides a foundation and guidance for assessing genetic data risk.

Reflection of successful anticancer drug development processes in the literature.

Heinemann, Fabian; Huber, Torsten; Meisel, Christian; Bundschus, Markus; Leser, Ulf.

Drug Discov Today ; 21(11): 1740-1744, 2016 11.

Article in English | MEDLINE | ID: mdl-27443674

ABSTRACT

The development of cancer drugs is time-consuming and expensive. In particular, failures in late-stage clinical trials are a major cost driver for pharmaceutical companies. This puts a high demand on methods that provide insights into the success chances of new potential medicines. In this study, we systematically analyze publication patterns emerging along the drug discovery process of targeted cancer therapies, starting from basic research to drug approval - or failure. We find clear differences in the patterns of approved drugs compared with those that failed in Phase II/III. Feeding these features into a machine learning classifier allows us to predict the approval or failure of a targeted cancer drug significantly better than educated guessing. We believe that these findings could lead to novel measures for supporting decision making in drug development.

Subject(s)

Antineoplastic Agents , Drug Approval/statistics & numerical data , Drug Discovery , Publishing/statistics & numerical data , Biomedical Research , Machine Learning

Text mining patents for biomedical knowledge.

Rodriguez-Esteban, Raul; Bundschus, Markus.

Drug Discov Today ; 21(6): 997-1002, 2016 06.

Article in English | MEDLINE | ID: mdl-27179985

ABSTRACT

Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery.

Subject(s)

Data Mining , Patents as Topic , Biomedical Research , Drug Discovery

Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed.

Eisinger, Daniel; Tsatsaronis, George; Bundschus, Markus; Wieneke, Ulrich; Schroeder, Michael.

J Biomed Semantics ; 4 Suppl 1: S3, 2013 Apr 15.

Article in English | MEDLINE | ID: mdl-23734562

ABSTRACT

Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms.Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources.

Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases.

Bauer-Mehren, Anna; Bundschus, Markus; Rautschka, Michael; Mayer, Miguel A; Sanz, Ferran; Furlong, Laura I.

PLoS One ; 6(6): e20284, 2011.

Article in English | MEDLINE | ID: mdl-21695124

ABSTRACT

BACKGROUND: Scientists have been trying to understand the molecular mechanisms of diseases to design preventive and therapeutic strategies for a long time. For some diseases, it has become evident that it is not enough to obtain a catalogue of the disease-related genes but to uncover how disruptions of molecular networks in the cell give rise to disease phenotypes. Moreover, with the unprecedented wealth of information available, even obtaining such catalogue is extremely difficult. PRINCIPAL FINDINGS: We developed a comprehensive gene-disease association database by integrating associations from several sources that cover different biomedical aspects of diseases. In particular, we focus on the current knowledge of human genetic diseases including mendelian, complex and environmental diseases. To assess the concept of modularity of human diseases, we performed a systematic study of the emergent properties of human gene-disease networks by means of network topology and functional annotation analysis. The results indicate a highly shared genetic origin of human diseases and show that for most diseases, including mendelian, complex and environmental diseases, functional modules exist. Moreover, a core set of biological pathways is found to be associated with most human diseases. We obtained similar results when studying clusters of diseases, suggesting that related diseases might arise due to dysfunction of common biological processes in the cell. CONCLUSIONS: For the first time, we include mendelian, complex and environmental diseases in an integrated gene-disease association database and show that the concept of modularity applies for all of them. We furthermore provide a functional analysis of disease-related modules providing important new biological insights, which might not be discovered when considering each of the gene-disease association repositories independently. Hence, we present a suitable framework for the study of how genetic and environmental factors, such as drugs, contribute to diseases. AVAILABILITY: The gene-disease networks used in this study and part of the analysis are available at http://ibi.imim.es/DisGeNET/DisGeNETweb.html#Download.

Subject(s)

Disease/genetics , Environment , Gene Regulatory Networks/genetics , Cluster Analysis , Genetic Association Studies , Humans , Multigene Family/genetics , Phenotype

Extraction of semantic biomedical relations from text using conditional random fields.

Bundschus, Markus; Dejori, Mathaeus; Stetter, Martin; Tresp, Volker; Kriegel, Hans-Peter.

BMC Bioinformatics ; 9: 207, 2008 Apr 23.

Article in English | MEDLINE | ID: mdl-18433469

ABSTRACT

BACKGROUND: The increasing amount of published literature in biomedicine represents an immense source of knowledge, which can only efficiently be accessed by a new generation of automated information extraction tools. Named entity recognition of well-defined objects, such as genes or proteins, has achieved a sufficient level of maturity such that it can form the basis for the next step: the extraction of relations that exist between the recognized entities. Whereas most early work focused on the mere detection of relations, the classification of the type of relation is also of great importance and this is the focus of this work. In this paper we describe an approach that extracts both the existence of a relation and its type. Our work is based on Conditional Random Fields, which have been applied with much success to the task of named entity recognition. RESULTS: We benchmark our approach on two different tasks. The first task is the identification of semantic relations between diseases and treatments. The available data set consists of manually annotated PubMed abstracts. The second task is the identification of relations between genes and diseases from a set of concise phrases, so-called GeneRIF (Gene Reference Into Function) phrases. In our experimental setting, we do not assume that the entities are given, as is often the case in previous relation extraction work. Rather the extraction of the entities is solved as a subproblem. Compared with other state-of-the-art approaches, we achieve very competitive results on both data sets. To demonstrate the scalability of our solution, we apply our approach to the complete human GeneRIF database. The resulting gene-disease network contains 34758 semantic associations between 4939 genes and 1745 diseases. The gene-disease network is publicly available as a machine-readable RDF graph. CONCLUSION: We extend the framework of Conditional Random Fields towards the annotation of semantic relations from text and apply it to the biomedical domain. Our approach is based on a rich set of textual features and achieves a performance that is competitive to leading approaches. The model is quite general and can be extended to handle arbitrary biological entities and relation types. The resulting gene-disease network shows that the GeneRIF database provides a rich knowledge source for text mining. Current work is focused on improving the accuracy of detection of entities as well as entity boundaries, which will also greatly improve the relation extraction performance.

Subject(s)

Database Management Systems , Natural Language Processing , Biomedical Research/methods , Database Management Systems/standards , Database Management Systems/statistics & numerical data , Databases, Genetic , Disease/classification , Disease/etiology , Genes/physiology , Humans , MEDLINE , Models, Statistical , Semantics , Sequence Analysis , Terminology as Topic , Therapeutics/classification , Vocabulary, Controlled

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL