Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 8 de 8
Filter
1.
Database (Oxford) ; 20222022 05 25.
Article in English | MEDLINE | ID: mdl-35616100

ABSTRACT

Despite progress in the development of standards for describing and exchanging scientific information, the lack of easy-to-use standards for mapping between different representations of the same or similar objects in different databases poses a major impediment to data integration and interoperability. Mappings often lack the metadata needed to be correctly interpreted and applied. For example, are two terms equivalent or merely related? Are they narrow or broad matches? Or are they associated in some other way? Such relationships between the mapped terms are often not documented, which leads to incorrect assumptions and makes them hard to use in scenarios that require a high degree of precision (such as diagnostics or risk prediction). Furthermore, the lack of descriptions of how mappings were done makes it hard to combine and reconcile mappings, particularly curated and automated ones. We have developed the Simple Standard for Sharing Ontological Mappings (SSSOM) which addresses these problems by: (i) Introducing a machine-readable and extensible vocabulary to describe metadata that makes imprecision, inaccuracy and incompleteness in mappings explicit. (ii) Defining an easy-to-use simple table-based format that can be integrated into existing data science pipelines without the need to parse or query ontologies, and that integrates seamlessly with Linked Data principles. (iii) Implementing open and community-driven collaborative workflows that are designed to evolve the standard continuously to address changing requirements and mapping practices. (iv) Providing reference tools and software libraries for working with the standard. In this paper, we present the SSSOM standard, describe several use cases in detail and survey some of the existing work on standardizing the exchange of mappings, with the goal of making mappings Findable, Accessible, Interoperable and Reusable (FAIR). The SSSOM specification can be found at http://w3id.org/sssom/spec. Database URL: http://w3id.org/sssom/spec.


Subject(s)
Metadata , Semantic Web , Data Management , Databases, Factual , Workflow
2.
J Am Med Inform Assoc ; 28(3): 427-443, 2021 03 01.
Article in English | MEDLINE | ID: mdl-32805036

ABSTRACT

OBJECTIVE: Coronavirus disease 2019 (COVID-19) poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers. MATERIALS AND METHODS: The Clinical and Translational Science Award Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics. RESULTS: Organized in inclusive workstreams, we created legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access. CONCLUSIONS: The N3C has demonstrated that a multisite collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multiorganizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19.


Subject(s)
COVID-19 , Data Science/organization & administration , Information Dissemination , Intersectoral Collaboration , Computer Security , Data Analysis , Ethics Committees, Research , Government Regulation , Humans , National Institutes of Health (U.S.) , United States
3.
medRxiv ; 2020 Oct 04.
Article in English | MEDLINE | ID: mdl-33024984

ABSTRACT

Importance: COVID-19 racial disparities have gained significant attention yet little is known about how age distributions obscure racial-ethnic disparities in COVID-19 case fatality ratios (CFR). Objective: We filled this gap by assessing relevant data availability and quality across states, and in states with available data, investigating how racial-ethnic disparities in CFR changed after age adjustment. Design/Setting/Participants/Exposure: We conducted a landscape analysis as of July 1st, 2020 and developed a grading system to assess COVID-19 case and death data by age and race in 50 states and DC. In states where age- and race-specific data were available, we applied direct age standardization to compare CFR across race-ethnicities. We developed an online dashboard to automatically and continuously update our results. Main Outcome and Measure: Our main outcome was CFR (deaths per 100 confirmed cases). We examined CFR by age and race-ethnicities. Results: We found substantial variations in disaggregating and reporting case and death data across states. Only three states, California, Illinois and Ohio, had sufficient age- and race-ethnicity-disaggregation to allow the investigation of racial-ethnic disparities in CFR while controlling for age. In total, we analyzed 391,991confirmed cases and 17,612 confirmed deaths. The crude CFRs varied from, e.g. 7.35% among Non-Hispanic (NH) White population to 1.39% among Hispanic population in Ohio. After age standardization, racial-ethnic differences in CFR narrowed, e.g. from 5.28% among NH White population to 3.79% among NH Asian population in Ohio, or an over one-fold difference. In addition, the ranking of race-ethnic-specific CFRs changed after age standardization. NH White population had the leading crude CFRs whereas NH Black and NH Asian population had the leading and second leading age-adjusted CFRs respectively in two of the three states. Hispanic population's age-adjusted CFR were substantially higher than the crude. Sensitivity analysis did not change these results qualitatively. Conclusions and Relevance: The availability and quality of age- and race-ethnic-specific COVID-19 case and death data varied greatly across states. Age distributions in confirmed cases obscured racial-ethnic disparities in COVID-19 CFR. Age standardization narrows racial-ethnic disparities and changes ranking. Public COVID-19 data availability, quality, and harmonization need improvement to address racial disparities in this pandemic.

4.
Methods ; 129: 8-17, 2017 10 01.
Article in English | MEDLINE | ID: mdl-28454776

ABSTRACT

Recent years have witnessed unprecedented accumulation of DNA sequences and therefore protein sequences (predicted from DNA sequences), due to the advances of sequencing technology. One of the major sources of the hypothetical proteins is the metagenomics research. Current annotation of metagenomes (collections of short metagenomic sequences or assemblies) relies on similarity searches against known gene/protein families, based on which functional profiles of microbial communities can be built. This practice, however, leaves out the hypothetical proteins, which may outnumber the known proteins for many microbial communities. On the other hand, we may ask: what can we gain from the large number of metagenomes made available by the metagenomic studies, for the annotation of metagenomic sequences as well as functional annotation of hypothetical proteins in general? Here we propose a community profiling approach for predicting functional associations between proteins: two proteins are predicted to be associated if they share similar presence and absence profiles (called community profiles) across microbial communities. Community profiling is conceptually similar to the phylogenetic profiling approach to functional prediction, however with fundamental differences. We tested different profile construction methods, the selection of reference metagenomes, and correlation metrics, among others, to optimize the performance of this new approach. We demonstrated that the community profiling approach alone slightly outperforms the phylogenetic profiling approach for associating proteins in species that are well represented by sequenced genomes, and combining phylogenetic and community profiling further improves (though only marginally) the prediction of functional association. Further we showed that community profiling method significantly outperforms phylogenetic profiling, revealing more functional associations, when applied to a more recently sequenced bacterial genome.


Subject(s)
Metagenomics , Microbial Consortia/genetics , Sequence Analysis, DNA/methods , Software , Algorithms , Computational Biology/methods , Databases, Genetic , Genome, Bacterial , Phylogeny
5.
PLoS Comput Biol ; 9(3): e1002981, 2013.
Article in English | MEDLINE | ID: mdl-23555216

ABSTRACT

Shotgun metagenomics has been applied to the studies of the functionality of various microbial communities. As a critical analysis step in these studies, biological pathways are reconstructed based on the genes predicted from metagenomic shotgun sequences. Pathway reconstruction provides insights into the functionality of a microbial community and can be used for comparing multiple microbial communities. The utilization of pathway reconstruction, however, can be jeopardized because of imperfect functional annotation of genes, and ambiguity in the assignment of predicted enzymes to biochemical reactions (e.g., some enzymes are involved in multiple biochemical reactions). Considering that metabolic functions in a microbial community are carried out by many enzymes in a collaborative manner, we present a probabilistic sampling approach to profiling functional content in a metagenomic dataset, by sampling functions of catalytically promiscuous enzymes within the context of the entire metabolic network defined by the annotated metagenome. We test our approach on metagenomic datasets from environmental and human-associated microbial communities. The results show that our approach provides a more accurate representation of the metabolic activities encoded in a metagenome, and thus improves the comparative analysis of multiple microbial communities. In addition, our approach reports likelihood scores of putative reactions, which can be used to identify important reactions and metabolic pathways that reflect the environmental adaptation of the microbial communities. Source code for sampling metabolic networks is available online at http://omics.informatics.indiana.edu/mg/MetaNetSam/.


Subject(s)
Metabolic Networks and Pathways/genetics , Metagenome/genetics , Metagenomics/methods , Algorithms , Cluster Analysis , Databases, Genetic , Environmental Microbiology , Humans , Markov Chains
6.
BMC Bioinformatics ; 11: 255, 2010 May 17.
Article in English | MEDLINE | ID: mdl-20478034

ABSTRACT

BACKGROUND: Recently there has been an explosion of new data sources about genes, proteins, genetic variations, chemical compounds, diseases and drugs. Integration of these data sources and the identification of patterns that go across them is of critical interest. Initiatives such as Bio2RDF and LODD have tackled the problem of linking biological data and drug data respectively using RDF. Thus far, the inclusion of chemogenomic and systems chemical biology information that crosses the domains of chemistry and biology has been very limited RESULTS: We have created a single repository called Chem2Bio2RDF by aggregating data from multiple chemogenomics repositories that is cross-linked into Bio2RDF and LODD. We have also created a linked-path generation tool to facilitate SPARQL query generation, and have created extended SPARQL functions to address specific chemical/biological search needs. We demonstrate the utility of Chem2Bio2RDF in investigating polypharmacology, identification of potential multiple pathway inhibitors, and the association of pathways with adverse drug reactions. CONCLUSIONS: We have created a new semantic systems chemical biology resource, and have demonstrated its potential usefulness in specific examples of polypharmacology, multiple pathway inhibition and adverse drug reaction--pathway mapping. We have also demonstrated the usefulness of extending SPARQL with cheminformatics and bioinformatics functionality.


Subject(s)
Data Mining/methods , Databases, Factual , Software , Systems Biology , Internet , Semantics , Systems Integration
7.
J Chem Inf Model ; 49(2): 263-9, 2009 Feb.
Article in English | MEDLINE | ID: mdl-19434828

ABSTRACT

This paper proposes a system that automatically extracts CYP protein and chemical interactions from journal article abstracts, using natural language processing (NLP) and text mining methods. In our system, we employ a maximum entropy based learning method, using results from syntactic, semantic, and lexical analysis of texts. We first present our system architecture and then discuss the data set for training our machine learning based models and the methods in building components in our system, such as part of speech (POS) tagging, Named Entity Recognition (NER), dependency parsing, and relation extraction. An evaluation of the system is conducted at the end, yielding very promising results: The POS, dependency parsing, and NER components in our system have achieved a very high level of accuracy as measured by precision, ranging from 85.9% to 98.5%, and the precision and the recall of the interaction extraction component are 76.0% and 82.6%, and for the overall system are 68.4% and 72.2%, respectively.


Subject(s)
Information Storage and Retrieval , Programming Languages , Cytochrome P-450 Enzyme System/chemistry
8.
BMC Bioinformatics ; 8: 487, 2007 Dec 21.
Article in English | MEDLINE | ID: mdl-18154664

ABSTRACT

BACKGROUND: The web has seen an explosion of chemistry and biology related resources in the last 15 years: thousands of scientific journals, databases, wikis, blogs and resources are available with a wide variety of types of information. There is a huge need to aggregate and organise this information. However, the sheer number of resources makes it unrealistic to link them all in a centralised manner. Instead, search engines to find information in those resources flourish, and formal languages like Resource Description Framework and Web Ontology Language are increasingly used to allow linking of resources. A recent development is the use of userscripts to change the appearance of web pages, by on-the-fly modification of the web content. This opens possibilities to aggregate information and computational results from different web resources into the web page of one of those resources. RESULTS: Several userscripts are presented that enrich biology and chemistry related web resources by incorporating or linking to other computational or data sources on the web. The scripts make use of Greasemonkey-like plugins for web browsers and are written in JavaScript. Information from third-party resources are extracted using open Application Programming Interfaces, while common Universal Resource Locator schemes are used to make deep links to related information in that external resource. The userscripts presented here use a variety of techniques and resources, and show the potential of such scripts. CONCLUSION: This paper discusses a number of userscripts that aggregate information from two or more web resources. Examples are shown that enrich web pages with information from other resources, and show how information from web pages can be used to link to, search, and process information in other resources. Due to the nature of userscripts, scientists are able to select those scripts they find useful on a daily basis, as the scripts run directly in their own web browser rather than on the web server. This flexibility allows the scientists to tune the features of web resources to optimise their productivity.


Subject(s)
Biological Science Disciplines/education , Database Management Systems/organization & administration , Internet/organization & administration , Programming Languages , User-Computer Interface , Artificial Intelligence , Computer-Assisted Instruction/methods , Education, Distance/methods , Humans , Hypermedia , Information Services/organization & administration , Information Storage and Retrieval , Internet/statistics & numerical data , Medical Informatics/methods
SELECTION OF CITATIONS
SEARCH DETAIL
...