Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
2.
Sci Data ; 11(1): 524, 2024 May 22.
Article in English | MEDLINE | ID: mdl-38778016

ABSTRACT

Datasets consist of measurement data and metadata. Metadata provides context, essential for understanding and (re-)using data. Various metadata standards exist for different methods, systems and contexts. However, relevant information resides at differing stages across the data-lifecycle. Often, this information is defined and standardized only at publication stage, which can lead to data loss and workload increase. In this study, we developed Metadatasheet, a metadata standard based on interviews with members of two biomedical consortia and systematic screening of data repositories. It aligns with the data-lifecycle allowing synchronous metadata recording within Microsoft Excel, a widespread data recording software. Additionally, we provide an implementation, the Metadata Workbook, that offers user-friendly features like automation, dynamic adaption, metadata integrity checks, and export options for various metadata standards. By design and due to its extensive documentation, the proposed metadata standard simplifies recording and structuring of metadata for biomedical scientists, promoting practicality and convenience in data management. This framework can accelerate scientific progress by enhancing collaboration and knowledge transfer throughout the intermediate steps of data creation.


Subject(s)
Data Management , Metadata , Biomedical Research , Data Management/standards , Metadata/standards , Software
6.
Lancet Digit Health ; 3(1): e51-e66, 2021 01.
Article in English | MEDLINE | ID: mdl-33735069

ABSTRACT

Health data that are publicly available are valuable resources for digital health research. Several public datasets containing ophthalmological imaging have been frequently used in machine learning research; however, the total number of datasets containing ophthalmological health information and their respective content is unclear. This Review aimed to identify all publicly available ophthalmological imaging datasets, detail their accessibility, describe which diseases and populations are represented, and report on the completeness of the associated metadata. With the use of MEDLINE, Google's search engine, and Google Dataset Search, we identified 94 open access datasets containing 507 724 images and 125 videos from 122 364 patients. Most datasets originated from Asia, North America, and Europe. Disease populations were unevenly represented, with glaucoma, diabetic retinopathy, and age-related macular degeneration disproportionately overrepresented in comparison with other eye diseases. The reporting of basic demographic characteristics such as age, sex, and ethnicity was poor, even at the aggregate level. This Review provides greater visibility for ophthalmological datasets that are publicly available as powerful resources for research. Our paper also exposes an increasing divide in the representation of different population and disease groups in health data repositories. The improved reporting of metadata would enable researchers to access the most appropriate datasets for their needs and maximise the potential of such resources.


Subject(s)
Databases, Factual , Datasets as Topic , Diagnostic Imaging/methods , Eye Diseases/diagnostic imaging , Ophthalmology , Humans , Metadata/standards
7.
Math Biosci ; 333: 108545, 2021 03.
Article in English | MEDLINE | ID: mdl-33460673

ABSTRACT

The SARS-CoV-2 virus has spread across the world, testing each nation's ability to understand the state of the pandemic in their country and control it. As we looked into the epidemiological data to uncover the impact of the COVID-19 pandemic, we discovered that critical metadata is missing which is meant to give context to epidemiological parameters. In this review, we identify key metadata for the COVID-19 fatality rate after a thorough analysis of mathematical models, serology-informed studies and determinants of causes of death for the COVID-19 pandemic. In doing so, we find reasons to establish a set of standard-based guidelines to record and report the data from epidemiological studies. Additionally, we discuss why standardizing nomenclature is be a necessary component of these guidelines to improve communication and reproducibility. The goal of establishing these guidelines is to facilitate the interpretation of COVID-19 epidemiological findings and data by the general public, health officials, policymakers and fellow researchers. Our suggestions may not address all aspects of this issue; rather, they are meant to be the foundation for which experts can establish and encourage future guidelines throughout the appropriate communities.


Subject(s)
COVID-19/epidemiology , COVID-19/mortality , Health Communication/standards , Pandemics , SARS-CoV-2 , COVID-19 Serological Testing/statistics & numerical data , Epidemiology/standards , Epidemiology/statistics & numerical data , Epidemiology/trends , Humans , Mathematical Concepts , Metadata/standards , Models, Statistical , Public Health/standards , Public Health/statistics & numerical data , Public Health/trends , Reproducibility of Results , Risk Factors , Seroepidemiologic Studies , United States/epidemiology
8.
Nucleic Acids Res ; 49(D1): D743-D750, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33221926

ABSTRACT

Metagenomics became a standard strategy to comprehend the functional potential of microbial communities, including the human microbiome. Currently, the number of metagenomes in public repositories is increasing exponentially. The Sequence Read Archive (SRA) and the MG-RAST are the two main repositories for metagenomic data. These databases allow scientists to reanalyze samples and explore new hypotheses. However, mining samples from them can be a limiting factor, since the metadata available in these repositories is often misannotated, misleading, and decentralized, creating an overly complex environment for sample reanalysis. The main goal of the HumanMetagenomeDB is to simplify the identification and use of public human metagenomes of interest. HumanMetagenomeDB version 1.0 contains metadata of 69 822 metagenomes. We standardized 203 attributes, based on standardized ontologies, describing host characteristics (e.g. sex, age and body mass index), diagnosis information (e.g. cancer, Crohn's disease and Parkinson), location (e.g. country, longitude and latitude), sampling site (e.g. gut, lung and skin) and sequencing attributes (e.g. sequencing platform, average length and sequence quality). Further, HumanMetagenomeDB version 1.0 metagenomes encompass 58 countries, 9 main sample sites (i.e. body parts), 58 diagnoses and multiple ages, ranging from just born to 91 years old. The HumanMetagenomeDB is publicly available at https://webapp.ufz.de/hmgdb/.


Subject(s)
Data Curation , Databases, Genetic/standards , Metadata/standards , Metagenome , Humans , Metagenomics , Reference Standards , User-Computer Interface
11.
Trends Genet ; 36(6): 390-394, 2020 06.
Article in English | MEDLINE | ID: mdl-32396832

ABSTRACT

Although public repository requirements are aimed at researchers and designed to ensure that the utility of the limited data we have is optimized, these policies also have ramifications for research participants. In this opinion article, I discuss how the nature of such repositories can subject participants whose data are 'banked' to unwitting participation in scientific projects they might find objectionable. In addition, concerns about the privacy of banked genomic data are exacerbated by recent projects that demonstrate the ability to re-identify genomic data, raising the specter of discriminatory or oppressive use of this information. These concerns are most likely to discourage participation in research that requires data sharing among those who have experienced these phenomena and are less likely to discount their likelihood.


Subject(s)
Biological Variation, Population , Biomedical Research/standards , Databases, Genetic/standards , Genomics/standards , Information Dissemination/methods , Metadata/standards , Humans , Patient Selection , Privacy
12.
Br J Radiol ; 93(1109): 20190574, 2020 May 01.
Article in English | MEDLINE | ID: mdl-31971816

ABSTRACT

Healthcare is increasingly and routinely generating large volumes of data from different sources, which are difficult to handle and integrate. Confidence in data can be established through the knowledge that the data are validated, well-curated and with minimal bias or errors. As the National Measurement Institute of the UK, the National Physical Laboratory (NPL) is running an interdisciplinary project on digital health data curation. The project addresses one of the key challenges of the UK's Measurement Strategy, to provide confidence in the intelligent and effective use of data. A workshop was organised by NPL in which important stakeholders from NHS, industry and academia outlined the current and future challenges in healthcare data curation. This paper summarises the findings of the workshop and outlines NPL's views on how a metrological approach to the curation of healthcare data sets could help solve some of the important and emerging challenges of utilising healthcare data.


Subject(s)
Data Collection/methods , Medical Informatics/methods , Research Design/standards , Data Collection/standards , Diffusion of Innovation , Humans , Medical Informatics/standards , Metadata/standards , Telemedicine/methods , Telemedicine/standards , United Kingdom
14.
Molecules ; 24(8)2019 Apr 23.
Article in English | MEDLINE | ID: mdl-31018579

ABSTRACT

The Toxicology in the 21st Century (Tox21) project seeks to develop and test methods for high-throughput examination of the effect certain chemical compounds have on biological systems. Although primary and toxicity assay data were readily available for multiple reporter gene modified cell lines, extensive annotation and curation was required to improve these datasets with respect to how FAIR (Findable, Accessible, Interoperable, and Reusable) they are. In this study, we fully annotated the Tox21 published data with relevant and accepted controlled vocabularies. After removing unreliable data points, we aggregated the results and created three sets of signatures reflecting activity in the reporter gene assays, cytotoxicity, and selective reporter gene activity, respectively. We benchmarked these signatures using the chemical structures of the tested compounds and obtained generally high receiver operating characteristic (ROC) scores, suggesting good quality and utility of these signatures and the underlying data. We analyzed the results to identify promiscuous individual compounds and chemotypes for the three signature categories and interpreted the results to illustrate the utility and re-usability of the datasets. With this study, we aimed to demonstrate the importance of data standards in reporting screening results and high-quality annotations to enable re-use and interpretation of these data. To improve the data with respect to all FAIR criteria, all assay annotations, cleaned and aggregate datasets, and signatures were made available as standardized dataset packages (Aggregated Tox21 bioactivity data, 2019).


Subject(s)
Data Curation/statistics & numerical data , Gene Expression Regulation/drug effects , Metadata/standards , Pharmacogenetics/methods , Toxicology/methods , Xenobiotics/toxicity , Benchmarking , Datasets as Topic , Gene Expression Profiling , Genes, Reporter , High-Throughput Screening Assays/standards , Humans , Xenobiotics/chemistry , Xenobiotics/classification
15.
Sci Data ; 6: 190021, 2019 02 19.
Article in English | MEDLINE | ID: mdl-30778255

ABSTRACT

We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample-a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples-a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.


Subject(s)
Biological Specimen Banks , Metadata/standards , Data Accuracy
16.
Int J Med Inform ; 121: 10-18, 2019 01.
Article in English | MEDLINE | ID: mdl-30545485

ABSTRACT

OBJECTIVE: Reproducibility of research studies is key to advancing biomedical science by building on sound results and reducing inconsistencies between published results and study data. We propose that the available data from research studies combined with provenance metadata provide a framework for evaluating scientific reproducibility. We developed the ProvCaRe platform to model, extract, and query semantic provenance information from 435, 248 published articles. METHODS: The ProvCaRe platform consists of: (1) the S3 model and a formal ontology; (2) a provenance-focused text processing workflow to generate provenance triples consisting of subject, predicate, and object using metadata extracted from articles; and (3) the ProvCaRe knowledge repository that supports "provenance-aware" hypothesis-driven search queries. A new provenance-based ranking algorithm is used to rank the articles in the search query results. RESULTS: The ProvCaRe knowledge repository contains 48.9 million provenance triples. Seven research hypotheses were used as search queries for evaluation and the resulting provenance triples were analyzed using five categories of provenance terms. The highest number of terms (34%) described provenance related to population cohort followed by 29% of terms describing statistical data analysis methods, and only 5% of the terms described the measurement instruments used in a study. In addition, the analysis showed that some articles included a higher number of provenance terms across multiple provenance categories suggesting a higher potential for reproducibility of these research studies. CONCLUSION: The ProvCaRe knowledge repository (https://provcare. CASE: edu/) is one of the largest provenance resources for biomedical research studies that combines intuitive search functionality with a new provenance-based ranking feature to list articles related to a search query.


Subject(s)
Algorithms , Biological Ontologies , Biomedical Research/standards , Metadata/standards , Semantics , Humans , Reproducibility of Results
17.
Sci Data ; 5: 180258, 2018 11 20.
Article in English | MEDLINE | ID: mdl-30457569

ABSTRACT

Clinical case reports (CCRs) provide an important means of sharing clinical experiences about atypical disease phenotypes and new therapies. However, published case reports contain largely unstructured and heterogeneous clinical data, posing a challenge to mining relevant information. Current indexing approaches generally concern document-level features and have not been specifically designed for CCRs. To address this disparity, we developed a standardized metadata template and identified text corresponding to medical concepts within 3,100 curated CCRs spanning 15 disease groups and more than 750 reports of rare diseases. We also prepared a subset of metadata on reports on selected mitochondrial diseases and assigned ICD-10 diagnostic codes to each. The resulting resource, Metadata Acquired from Clinical Case Reports (MACCRs), contains text associated with high-level clinical concepts, including demographics, disease presentation, treatments, and outcomes for each report. Our template and MACCR set render CCRs more findable, accessible, interoperable, and reusable (FAIR) while serving as valuable resources for key user groups, including researchers, physician investigators, clinicians, data scientists, and those shaping government policies for clinical trials.


Subject(s)
Clinical Studies as Topic , Data Curation , Metadata , Computational Biology , Data Analysis , Data Curation/methods , Data Curation/standards , Humans , Metadata/standards
18.
Anim Genet ; 49(6): 520-526, 2018 Dec.
Article in English | MEDLINE | ID: mdl-30311252

ABSTRACT

The Functional Annotation of ANimal Genomes (FAANG) project aims, through a coordinated international effort, to provide high quality functional annotation of animal genomes with an initial focus on farmed and companion animals. A key goal of the initiative is to ensure high quality and rich supporting metadata to describe the project's animals, specimens, cell cultures and experimental assays. By defining rich sample and experimental metadata standards and promoting best practices in data descriptions, deposition and openness, FAANG champions higher quality and reusability of published datasets. FAANG has established a Data Coordination Centre, which sits at the heart of the Metadata and Data Sharing Committee. It continues to evolve the metadata standards, support submissions and, crucially, create powerful and accessible tools to support deposition and validation of metadata. FAANG conforms to the findable, accessible, interoperable, and reusable (FAIR) data principles, with high quality, open access and functionally interlinked data. In addition to data generated by FAANG members and specific FAANG projects, existing datasets that meet the main-or more permissive legacy-standards are incorporated into a central, focused, functional data resource portal for the entire farmed and companion animal community. Through clear and effective metadata standards, validation and conversion software, combined with promotion of best practices in metadata implementation, FAANG aims to maximise effectiveness and inter-comparability of assay data. This supports the community to create a rich genome-to-phenotype resource and promotes continuing improvements in animal data standards as a whole.


Subject(s)
Data Curation/standards , Genomics , Metadata/standards , Animals , Livestock , Pets , Software
20.
Stud Health Technol Inform ; 247: 221-225, 2018.
Article in English | MEDLINE | ID: mdl-29677955

ABSTRACT

The establishment of a digital healthcare system is a national and community task. The Federal Ministry of Education and Research in Germany is providing funding for consortia consisting of university hospitals among others participating in the "Medical Informatics Initiative". Exchange of medical data between research institutions necessitates a place where meta information for this data is made accessible. Within these consortia different metadata registry solutions were chosen. To promote interoperability between these solutions, we have examined whether the portal of Medical Data Models is eligible for managing and communicating metadata and relevant information across different data integration centres of the Medical Informatics Initiative and beyond. Apart from the MDM-portal, some ISO 11179-based systems such as Samply.MDR as well as openEHR-based solutions are going to be applyed. In this paper, we have focused on the creation of a mapping model between the CDISC ODM standard and the Samply.MDR import format. In summary, it can be stated that the mapping model is feasible and promote the exchangeability between different metadata registry approaches.


Subject(s)
Biomedical Research , Metadata/standards , Registries , Germany , Humans , Reference Standards
SELECTION OF CITATIONS
SEARCH DETAIL
...