Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 23
Filter
Add more filters










Publication year range
1.
Sci Data ; 9(1): 696, 2022 11 12.
Article in English | MEDLINE | ID: mdl-36371407

ABSTRACT

It is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be "rich" and to adhere to "domain-relevant" community standards. Scientific communities should be able to define their own machine-actionable templates for metadata that encode these "rich," discipline-specific elements. We have explored this template-based approach in the context of two software systems. One system is the CEDAR Workbench, which investigators use to author new metadata. The other is the FAIRware Workbench, which evaluates the metadata of archived datasets for their adherence to community standards. Benefits accrue when templates for metadata become central elements in an ecosystem of tools to manage online datasets-both because the templates serve as a community reference for what constitutes FAIR data, and because they embody that perspective in a form that can be distributed among a variety of software applications to assist with data stewardship and data sharing.

2.
Front Big Data ; 5: 883341, 2022.
Article in English | MEDLINE | ID: mdl-35647536

ABSTRACT

Although all the technical components supporting fully orchestrated Digital Twins (DT) currently exist, what remains missing is a conceptual clarification and analysis of a more generalized concept of a DT that is made FAIR, that is, universally machine actionable. This methodological overview is a first step toward this clarification. We present a review of previously developed semantic artifacts and how they may be used to compose a higher-order data model referred to here as a FAIR Digital Twin (FDT). We propose an architectural design to compose, store and reuse FDTs supporting data intensive research, with emphasis on privacy by design and their use in GDPR compliant open science.

3.
Adv Genet (Hoboken) ; 2(2): e10050, 2021 Jun.
Article in English | MEDLINE | ID: mdl-34514430

ABSTRACT

The limited volume of COVID-19 data from Africa raises concerns for global genome research, which requires a diversity of genotypes for accurate disease prediction, including on the provenance of the new SARS-CoV-2 mutations. The Virus Outbreak Data Network (VODAN)-Africa studied the possibility of increasing the production of clinical data, finding concerns about data ownership, and the limited use of health data for quality treatment at point of care. To address this, VODAN Africa developed an architecture to record clinical health data and research data collected on the incidence of COVID-19, producing these as human- and machine-readable data objects in a distributed architecture of locally governed, linked, human- and machine-readable data. This architecture supports analytics at the point of care and-through data visiting, across facilities-for generic analytics. An algorithm was run across FAIR Data Points to visit the distributed data and produce aggregate findings. The FAIR data architecture is deployed in Uganda, Ethiopia, Liberia, Nigeria, Kenya, Somalia, Tanzania, Zimbabwe, and Tunisia.

4.
PeerJ ; 8: e8871, 2020.
Article in English | MEDLINE | ID: mdl-32341891

ABSTRACT

The grammatical structures scholars use to express their assertions are intended to convey various degrees of certainty or speculation. Prior studies have suggested a variety of categorization systems for scholarly certainty; however, these have not been objectively tested for their validity, particularly with respect to representing the interpretation by the reader, rather than the intention of the author. In this study, we use a series of questionnaires to determine how researchers classify various scholarly assertions, using three distinct certainty classification systems. We find that there are three distinct categories of certainty along a spectrum from high to low. We show that these categories can be detected in an automated manner, using a machine learning model, with a cross-validation accuracy of 89.2% relative to an author-annotated corpus, and 82.2% accuracy against a publicly-annotated corpus. This finding provides an opportunity for contextual metadata related to certainty to be captured as a part of text-mining pipelines, which currently miss these subtle linguistic cues. We provide an exemplar machine-accessible representation-a Nanopublication-where certainty category is embedded as metadata in a formal, ontology-based manner within text-mined scholarly assertions.

6.
Sci Data ; 6(1): 174, 2019 09 20.
Article in English | MEDLINE | ID: mdl-31541130

ABSTRACT

Transparent evaluations of FAIRness are increasingly required by a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers. We propose a scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guidelines, which come together to accommodate domain relevant community-defined FAIR assessments. The components of the framework are: (1) Maturity Indicators - community-authored specifications that delimit a specific automatically-measurable FAIR behavior; (2) Compliance Tests - small Web apps that test digital resources against individual Maturity Indicators; and (3) the Evaluator, a Web application that registers, assembles, and applies community-relevant sets of Compliance Tests against a digital resource, and provides a detailed report about what a machine "sees" when it visits that resource. We discuss the technical and social considerations of FAIR assessments, and how this translates to our community-driven infrastructure. We then illustrate how the output of the Evaluator tool can serve as a roadmap to assist data stewards to incrementally and realistically improve the FAIRness of their resources.

9.
J Biomed Inform ; 71: 178-189, 2017 07.
Article in English | MEDLINE | ID: mdl-28579531

ABSTRACT

PROBLEM: Biomedical literature and databases contain important clues for the identification of potential disease biomarkers. However, searching these enormous knowledge reservoirs and integrating findings across heterogeneous sources is costly and difficult. Here we demonstrate how semantically integrated knowledge, extracted from biomedical literature and structured databases, can be used to automatically identify potential migraine biomarkers. METHOD: We used a knowledge graph containing more than 3.5 million biomedical concepts and 68.4 million relationships. Biochemical compound concepts were filtered and ranked by their potential as biomarkers based on their connections to a subgraph of migraine-related concepts. The ranked results were evaluated against the results of a systematic literature review that was performed manually by migraine researchers. Weight points were assigned to these reference compounds to indicate their relative importance. RESULTS: Ranked results automatically generated by the knowledge graph were highly consistent with results from the manual literature review. Out of 222 reference compounds, 163 (73%) ranked in the top 2000, with 547 out of the 644 (85%) weight points assigned to the reference compounds. For reference compounds that were not in the top of the list, an extensive error analysis has been performed. When evaluating the overall performance, we obtained a ROC-AUC of 0.974. DISCUSSION: Semantic knowledge graphs composed of information integrated from multiple and varying sources can assist researchers in identifying potential disease biomarkers.


Subject(s)
Biomarkers , Data Mining , Databases, Factual , Migraine Disorders/diagnosis , Semantics , Automation , Humans , Publications
10.
Sci Data ; 3: 160018, 2016 Mar 15.
Article in English | MEDLINE | ID: mdl-26978244

ABSTRACT

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders-representing academia, industry, funding agencies, and scholarly publishers-have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.


Subject(s)
Data Collection , Data Curation , Research Design , Database Management Systems , Guidelines as Topic , Reproducibility of Results
11.
PLoS One ; 11(2): e0149621, 2016.
Article in English | MEDLINE | ID: mdl-26919047

ABSTRACT

High-throughput experimental methods such as medical sequencing and genome-wide association studies (GWAS) identify increasingly large numbers of potential relations between genetic variants and diseases. Both biological complexity (millions of potential gene-disease associations) and the accelerating rate of data production necessitate computational approaches to prioritize and rationalize potential gene-disease relations. Here, we use concept profile technology to expose from the biomedical literature both explicitly stated gene-disease relations (the explicitome) and a much larger set of implied gene-disease associations (the implicitome). Implicit relations are largely unknown to, or are even unintended by the original authors, but they vastly extend the reach of existing biomedical knowledge for identification and interpretation of gene-disease associations. The implicitome can be used in conjunction with experimental data resources to rationalize both known and novel associations. We demonstrate the usefulness of the implicitome by rationalizing known and novel gene-disease associations, including those from GWAS. To facilitate the re-use of implicit gene-disease associations, we publish our data in compliance with FAIR Data Publishing recommendations [https://www.force11.org/group/fairgroup] using nanopublications. An online tool (http://knowledge.bio) is available to explore established and potential gene-disease associations in the context of other biomedical relations.


Subject(s)
Computational Biology/methods , Databases, Genetic , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans
12.
J Biomed Semantics ; 6: 5, 2015.
Article in English | MEDLINE | ID: mdl-26464783

ABSTRACT

Data from high throughput experiments often produce far more results than can ever appear in the main text or tables of a single research article. In these cases, the majority of new associations are often archived either as supplemental information in an arbitrary format or in publisher-independent databases that can be difficult to find. These data are not only lost from scientific discourse, but are also elusive to automated search, retrieval and processing. Here, we use the nanopublication model to make scientific assertions that were concluded from a workflow analysis of Huntington's Disease data machine-readable, interoperable, and citable. We followed the nanopublication guidelines to semantically model our assertions as well as their provenance metadata and authorship. We demonstrate interoperability by linking nanopublication provenance to the Research Object model. These results indicate that nanopublications can provide an incentive for researchers to expose data that is interoperable and machine-readable for future use and preservation for which they can get credits for their effort. Nanopublications can have a leading role into hypotheses generation offering opportunities to produce large-scale data integration.

13.
Genome Biol ; 16: 22, 2015 Jan 05.
Article in English | MEDLINE | ID: mdl-25723102

ABSTRACT

The FANTOM5 project investigates transcription initiation activities in more than 1,000 human and mouse primary cells, cell lines and tissues using CAGE. Based on manual curation of sample information and development of an ontology for sample classification, we assemble the resulting data into a centralized data resource (http://fantom.gsc.riken.jp/5/). This resource contains web-based tools and data-access points for the research community to search and extract data related to samples, genes, promoter activities, transcription factors and enhancers across the FANTOM5 atlas.


Subject(s)
Genomics/methods , Promoter Regions, Genetic , Software , Transcription Initiation, Genetic , Animals , Computational Biology/methods , Databases, Genetic , Datasets as Topic , Gene Expression Profiling , Humans , Mice , Transcriptome , User-Computer Interface
14.
Nat Genet ; 47(2): 115-25, 2015 Feb.
Article in English | MEDLINE | ID: mdl-25581432

ABSTRACT

Many cancer-associated somatic copy number alterations (SCNAs) are known. Currently, one of the challenges is to identify the molecular downstream effects of these variants. Although several SCNAs are known to change gene expression levels, it is not clear whether each individual SCNA affects gene expression. We reanalyzed 77,840 expression profiles and observed a limited set of 'transcriptional components' that describe well-known biology, explain the vast majority of variation in gene expression and enable us to predict the biological function of genes. On correcting expression profiles for these components, we observed that the residual expression levels (in 'functional genomic mRNA' profiling) correlated strongly with copy number. DNA copy number correlated positively with expression levels for 99% of all abundantly expressed human genes, indicating global gene dosage sensitivity. By applying this method to 16,172 patient-derived tumor samples, we replicated many loci with aberrant copy numbers and identified recurrently disrupted genes in genomically unstable cancers.


Subject(s)
DNA Copy Number Variations , Gene Dosage , Gene Expression Regulation, Neoplastic/genetics , Genomics , Neoplasms/genetics , Transcriptome , Comparative Genomic Hybridization , Gene Expression Profiling , Gene Regulatory Networks , Genetic Loci , Humans , RNA, Messenger/genetics , RNA, Neoplasm/genetics
15.
J Biomed Semantics ; 5(Suppl 1 Proceedings of the Bio-Ontologies Spec Interest G): S6, 2014.
Article in English | MEDLINE | ID: mdl-25093075

ABSTRACT

BACKGROUND: Matching and comparing sequence annotations of different reference sequences is vital to genomics research, yet many annotation formats do not specify the reference sequence types or versions used. This makes the integration of annotations from different sources difficult and error prone. RESULTS: As part of our effort to create linked data for interoperable sequence annotations, we present an RDF data model for sequence annotation using the ontological framework established by the OBO Foundry ontologies and the Basic Formal Ontology (BFO). We defined reference sequences as the common domain of integration for sequence annotations, and identified three semantic relationships between sequence annotations. In doing so, we created the Reference Sequence Annotation to compensate for gaps in the SO and in its mapping to BFO, particularly for annotations that refer to versions of consensus reference sequences. Moreover, we present three integration models for sequence annotations using different reference assemblies. CONCLUSIONS: We demonstrated a working example of a sequence annotation instance, and how this instance can be linked to other annotations on different reference sequences. Sequence annotations in this format are semantically rich and can be integrated easily with different assemblies. We also identify other challenges of modeling reference sequences with the BFO.

16.
PLoS One ; 8(11): e78665, 2013.
Article in English | MEDLINE | ID: mdl-24260124

ABSTRACT

MOTIVATION: Weighted semantic networks built from text-mined literature can be used to retrieve known protein-protein or gene-disease associations, and have been shown to anticipate associations years before they are explicitly stated in the literature. Our text-mining system recognizes over 640,000 biomedical concepts: some are specific (i.e., names of genes or proteins) others generic (e.g., 'Homo sapiens'). Generic concepts may play important roles in automated information retrieval, extraction, and inference but may also result in concept overload and confound retrieval and reasoning with low-relevance or even spurious links. Here, we attempted to optimize the retrieval performance for protein-protein interactions (PPI) by filtering generic concepts (node filtering) or links to generic concepts (edge filtering) from a weighted semantic network. First, we defined metrics based on network properties that quantify the specificity of concepts. Then using these metrics, we systematically filtered generic information from the network while monitoring retrieval performance of known protein-protein interactions. We also systematically filtered specific information from the network (inverse filtering), and assessed the retrieval performance of networks composed of generic information alone. RESULTS: Filtering generic or specific information induced a two-phase response in retrieval performance: initially the effects of filtering were minimal but beyond a critical threshold network performance suddenly drops. Contrary to expectations, networks composed exclusively of generic information demonstrated retrieval performance comparable to unfiltered networks that also contain specific concepts. Furthermore, an analysis using individual generic concepts demonstrated that they can effectively support the retrieval of known protein-protein interactions. For instance the concept "binding" is indicative for PPI retrieval and the concept "mutation abnormality" is indicative for gene-disease associations. CONCLUSION: Generic concepts are important for information retrieval and cannot be removed from semantic networks without negative impact on retrieval performance.


Subject(s)
Data Mining/methods , Semantics , Vocabulary, Controlled , Humans
18.
Biotechnol J ; 8(2): 221-7, 2013 Feb.
Article in English | MEDLINE | ID: mdl-22965937

ABSTRACT

There is a growing need for sensitive and reliable nucleic acid detection methods that are convenient and inexpensive. Responsive and programmable DNA nanostructures have shown great promise as chemical detection systems. Here, we describe a DNA detection system employing the triggered self-assembly of a novel DNA dendritic nanostructure. The detection protocol is executed autonomously without external intervention. Detection begins when a specific, single-stranded target DNA strand (T) triggers a hybridization chain reaction (HCR) between two, distinct DNA hairpins (α and ß). Each hairpin opens and hybridizes up to two copies of the other. In the absence of T, α and ß are stable and remain in their poised, closed-hairpin form. In the presence of T, α hairpins are opened by toe-hold mediated strand-displacement, each of which then opens and hybridizes two ß hairpins. Likewise, each opened ß hairpin can open and hybridize two α hairpins. Hence, each layer of the growing dendritic nanostructure can in principle accommodate an exponentially increasing number of cognate molecules, generating a high molecular weight nanostructure. This HCR system has minimal sequence constraints, allowing reconfiguration for the detection of arbitrary target sequences. Here, we demonstrate detection of unique sequence identifiers of HIV and Chlamydia pathogens.


Subject(s)
DNA/chemistry , DNA/isolation & purification , Gold/chemistry , Metal Nanoparticles/chemistry , Biosensing Techniques/instrumentation , Biosensing Techniques/methods , Chlamydia/isolation & purification , Electrophoresis, Polyacrylamide Gel , HIV/isolation & purification , Nucleic Acid Conformation , Nucleic Acid Hybridization , Sequence Analysis, DNA/methods
19.
Hum Mutat ; 33(11): 1503-12, 2012 Nov.
Article in English | MEDLINE | ID: mdl-22736453

ABSTRACT

The advances in bioinformatics required to annotate human genomic variants and to place them in public data repositories have not kept pace with their discovery. Moreover, a law of diminishing returns has begun to operate both in terms of data publication and submission. Although the continued deposition of such data in the public domain is essential to maximize both their scientific and clinical utility, rewards for data sharing are few, representing a serious practical impediment to data submission. To date, two main strategies have been adopted as a means to encourage the submission of human genomic variant data: (1) database journal linkups involving the affiliation of a scientific journal with a publicly available database and (2) microattribution, involving the unambiguous linkage of data to their contributors via a unique identifier. The latter could in principle lead to the establishment of a microcitation-tracking system that acknowledges individual endeavor and achievement. Both approaches could incentivize potential data contributors, thereby encouraging them to share their data with the scientific community. Here, we summarize and critically evaluate approaches that have been proposed to address current deficiencies in data attribution and discuss ways in which they could become more widely adopted as novel scientific publication modalities.


Subject(s)
Genetic Variation , Genome, Human , Publishing , Computational Biology , Data Collection , Databases, Genetic , Humans , Peer Review, Research
20.
Anal Biochem ; 421(2): 622-31, 2012 Feb 15.
Article in English | MEDLINE | ID: mdl-22178910

ABSTRACT

Phage display screenings are frequently employed to identify high-affinity peptides or antibodies. Although successful, phage display is a laborious technology and is notorious for identification of false positive hits. To accelerate and improve the selection process, we have employed Illumina next generation sequencing to deeply characterize the Ph.D.-7 M13 peptide phage display library before and after several rounds of biopanning on KS483 osteoblast cells. Sequencing of the naive library after one round of amplification in bacteria identifies propagation advantage as an important source of false positive hits. Most important, our data show that deep sequencing of the phage pool after a first round of biopanning is already sufficient to identify positive phages. Whereas traditional sequencing of a limited number of clones after one or two rounds of selection is uninformative, the required additional rounds of biopanning are associated with the risk of losing promising clones propagating slower than nonbinding phages. Confocal and live cell imaging confirms that our screen successfully selected a peptide with very high binding and uptake in osteoblasts. We conclude that next generation sequencing can significantly empower phage display screenings by accelerating the finding of specific binders and restraining the number of false positive hits.


Subject(s)
Bacteriophage M13/genetics , High-Throughput Nucleotide Sequencing/methods , Peptide Library , Animals , Cell Line , Mice
SELECTION OF CITATIONS
SEARCH DETAIL
...