Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 37
Filtrar
1.
Bioinformatics ; 37(19): 3343-3348, 2021 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-33964129

RESUMO

MOTIVATION: Gene Ontology Causal Activity Models (GO-CAMs) assemble individual associations of gene products with cellular components, molecular functions and biological processes into causally linked activity flow models. Pathway databases such as the Reactome Knowledgebase create detailed molecular process descriptions of reactions and assemble them, based on sharing of entities between individual reactions into pathway descriptions. RESULTS: To convert the rich content of Reactome into GO-CAMs, we have developed a software tool, Pathways2GO, to convert the entire set of normal human Reactome pathways into GO-CAMs. This conversion yields standard GO annotations from Reactome content and supports enhanced quality control for both Reactome and GO, yielding a nearly seamless conversion between these two resources for the bioinformatics community. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

2.
PLoS One ; 16(3): e0231916, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33755673

RESUMO

AVAILABILITY: The API and associated software is open source and currently available for access at https://github.com/NCATS-Tangerine/translator-knowledge-beacon.


Assuntos
Conhecimento , Software , Bases de Dados Factuais , Internet
3.
Patterns (N Y) ; 2(1): 100155, 2021 Jan 08.
Artigo em Inglês | MEDLINE | ID: mdl-33196056

RESUMO

Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time-consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community vary drastically for different tasks; the optimal data for a machine learning task, for example, is much different from the data used to populate a browsable user interface for clinicians. To address these challenges, we created KG-COVID-19, a flexible framework that ingests and integrates heterogeneous biomedical data to produce knowledge graphs (KGs), and applied it to create a KG for COVID-19 response. This KG framework also can be applied to other problems in which siloed biomedical data must be quickly integrated for different research applications, including future pandemics.

4.
Database (Oxford) ; 20202020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-32283553

RESUMO

Hypothesis generation is a critical step in research and a cornerstone in the rare disease field. Research is most efficient when those hypotheses are based on the entirety of knowledge known to date. Systematic review articles are commonly used in biomedicine to summarize existing knowledge and contextualize experimental data. But the information contained within review articles is typically only expressed as free-text, which is difficult to use computationally. Researchers struggle to navigate, collect and remix prior knowledge as it is scattered in several silos without seamless integration and access. This lack of a structured information framework hinders research by both experimental and computational scientists. To better organize knowledge and data, we built a structured review article that is specifically focused on NGLY1 Deficiency, an ultra-rare genetic disease first reported in 2012. We represented this structured review as a knowledge graph and then stored this knowledge graph in a Neo4j database to simplify dissemination, querying and visualization of the network. Relative to free-text, this structured review better promotes the principles of findability, accessibility, interoperability and reusability (FAIR). In collaboration with domain experts in NGLY1 Deficiency, we demonstrate how this resource can improve the efficiency and comprehensiveness of hypothesis generation. We also developed a read-write interface that allows domain experts to contribute FAIR structured knowledge to this community resource. In contrast to traditional free-text review articles, this structured review exists as a living knowledge graph that is curated by humans and accessible to computational analyses. Finally, we have generalized this workflow into modular and repurposable components that can be applied to other domain areas. This NGLY1 Deficiency-focused network is publicly available at http://ngly1graph.org/. AVAILABILITY AND IMPLEMENTATION: Database URL: http://ngly1graph.org/. Network data files are at: https://github.com/SuLab/ngly1-graph and source code at: https://github.com/SuLab/bioknowledge-reviewer. CONTACT: asu@scripps.edu.


Assuntos
Pesquisa Biomédica/métodos , Biologia Computacional/métodos , Bases de Dados Factuais , Bases de Conhecimento , Animais , Pesquisa Biomédica/estatística & dados numéricos , Biologia Computacional/estatística & dados numéricos , Defeitos Congênitos da Glicosilação/genética , Defeitos Congênitos da Glicosilação/metabolismo , Curadoria de Dados/métodos , Mineração de Dados/métodos , Humanos , Internet , Peptídeo-N4-(N-acetil-beta-glucosaminil) Asparagina Amidase/deficiência , Peptídeo-N4-(N-acetil-beta-glucosaminil) Asparagina Amidase/genética , Peptídeo-N4-(N-acetil-beta-glucosaminil) Asparagina Amidase/metabolismo , Revisões Sistemáticas como Assunto
5.
Elife ; 92020 03 17.
Artigo em Inglês | MEDLINE | ID: mdl-32180547

RESUMO

Wikidata is a community-maintained knowledge base that has been assembled from repositories in the fields of genomics, proteomics, genetic variants, pathways, chemical compounds, and diseases, and that adheres to the FAIR principles of findability, accessibility, interoperability and reusability. Here we describe the breadth and depth of the biomedical knowledge contained within Wikidata, and discuss the open-source tools we have built to add information to Wikidata and to synchronize it with source databases. We also demonstrate several use cases for Wikidata, including the crowdsourced curation of biomedical ontologies, phenotype-based diagnosis of disease, and drug repurposing.


Assuntos
Disciplinas das Ciências Biológicas , Biologia Computacional , Bases de Dados Factuais , Genômica , Proteômica , Humanos , Reconhecimento Automatizado de Padrão
6.
Bioinformatics ; 36(4): 1226-1233, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31504205

RESUMO

MOTIVATION: Biomedical literature is growing at a rate that outpaces our ability to harness the knowledge contained therein. To mine valuable inferences from the large volume of literature, many researchers use information extraction algorithms to harvest information in biomedical texts. Information extraction is usually accomplished via a combination of manual expert curation and computational methods. Advances in computational methods usually depend on the time-consuming generation of gold standards by a limited number of expert curators. Citizen science is public participation in scientific research. We previously found that citizen scientists are willing and capable of performing named entity recognition of disease mentions in biomedical abstracts, but did not know if this was true with relationship extraction (RE). RESULTS: In this article, we introduce the Relationship Extraction Module of the web-based application Mark2Cure (M2C) and demonstrate that citizen scientists can perform RE. We confirm the importance of accurate named entity recognition on user performance of RE and identify design issues that impacted data quality. We find that the data generated by citizen scientists can be used to identify relationship types not currently available in the M2C Relationship Extraction Module. We compare the citizen science-generated data with algorithm-mined data and identify ways in which the two approaches may complement one another. We also discuss opportunities for future improvement of this system, as well as the potential synergies between citizen science, manual biocuration and natural language processing. AVAILABILITY AND IMPLEMENTATION: Mark2Cure platform: https://mark2cure.org; Mark2Cure source code: https://github.com/sulab/mark2cure; and data and analysis code for this article: https://github.com/gtsueng/M2C_rel_nb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Ciência do Cidadão , Processamento de Linguagem Natural , Armazenamento e Recuperação da Informação , Projetos de Pesquisa , Software
8.
Database (Oxford) ; 2017(1)2017 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-28365742

RESUMO

With the advancement of genome-sequencing technologies, new genomes are being sequenced daily. Although these sequences are deposited in publicly available data warehouses, their functional and genomic annotations (beyond genes which are predicted automatically) mostly reside in the text of primary publications. Professional curators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don't exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Here, we describe WikiGenomes (wikigenomes.org), a web application that facilitates the consumption and curation of genomic data by the entire scientific community. WikiGenomes is based on Wikidata, an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database. WikiGenomes empowers the individual genomic researcher to contribute their expertise to the curation effort and integrates the knowledge into Wikidata, enabling it to be accessed by anyone without restriction. Database URL: www.wikigenomes.org.


Assuntos
Bases de Dados de Ácidos Nucleicos , Genoma , Internet , Anotação de Sequência Molecular/métodos , Anotação de Sequência Molecular/normas
10.
Artigo em Inglês | MEDLINE | ID: mdl-27307137

RESUMO

Drug toxicity is a major concern for both regulatory agencies and the pharmaceutical industry. In this context, text-mining methods for the identification of drug side effects from free text are key for the development of up-to-date knowledge sources on drug adverse reactions. We present a new system for identification of drug side effects from the literature that combines three approaches: machine learning, rule- and knowledge-based approaches. This system has been developed to address the Task 3.B of Biocreative V challenge (BC5) dealing with Chemical-induced Disease (CID) relations. The first two approaches focus on identifying relations at the sentence-level, while the knowledge-based approach is applied both at sentence and abstract levels. The machine learning method is based on the BeFree system using two corpora as training data: the annotated data provided by the CID task organizers and a new CID corpus developed by crowdsourcing. Different combinations of results from the three strategies were selected for each run of the challenge. In the final evaluation setting, the system achieved the highest Recall of the challenge (63%). By performing an error analysis, we identified the main causes of misclassifications and areas for improving of our system, and highlighted the need of consistent gold standard data sets for advancing the state of the art in text mining of drug side effects.Database URL: https://zenodo.org/record/29887?ln»en#.VsL3yDLWR_V.


Assuntos
Distúrbios Induzidos Quimicamente , Crowdsourcing , Bases de Dados Factuais/normas , Aprendizado de Máquina/normas , Animais , Distúrbios Induzidos Quimicamente/genética , Distúrbios Induzidos Quimicamente/metabolismo , Crowdsourcing/métodos , Crowdsourcing/normas , Mineração de Dados/métodos , Mineração de Dados/normas , Humanos
11.
Gene ; 592(2): 235-8, 2016 Nov 05.
Artigo em Inglês | MEDLINE | ID: mdl-27150585

RESUMO

Wikipedia and other openly available resources are increasingly becoming commonly used sources of information not just among the lay public but even in academic circles including undergraduate students and postgraduate trainees. To enhance the quality of the Wikipedia articles, in 2013, we initiated the Gene Wiki Reviews on genes and proteins as a series of invited reviews that stipulated editing the corresponding Wikipedia article(s) that would be also subject to peer-review. Thus, while the review article serves as an authoritative snapshot of the field, the "living article" can continue to evolve with the crowdsourcing model of Wikipedia. After publication of over 50 articles, we surveyed the authors to assess the impact of the project. The author survey results revealed that the Gene Wiki project is achieving its major objectives to increase the involvement of scientists in authoring Wikipedia articles and to enhance the quantity and quality of the information about genes and their protein products. Thus, the dual publication model introduced in the Gene Wiki Reviews series represents a valuable innovation in scientific publishing and biomedical knowledge management. We invite experts on specific genes to contact the editors to take part in this project to enhance the quality and accessibility of information about the human genome.


Assuntos
Acesso à Informação , Bases de Dados de Ácidos Nucleicos/normas , Genoma Humano , Anotação de Sequência Molecular/normas , Humanos
12.
Bioinformatics ; 32(13): 2072-2074, 2016 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-27153723

RESUMO

UNLABELLED: Branch is a web application that provides users with the ability to interact directly with large biomedical datasets. The interaction is mediated through a collaborative graphical user interface for building and evaluating decision trees. These trees can be used to compose and test sophisticated hypotheses and to develop predictive models. Decision trees are built and evaluated based on a library of imported datasets and can be stored in a collective area for sharing and re-use. AVAILABILITY AND IMPLEMENTATION: Branch is hosted at http://biobranch.org/ and the open source code is available at http://bitbucket.org/sulab/biobranch/ CONTACTS: asu@scripps.edu or bgood@scripps.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Pesquisa Biomédica , Árvores de Decisões , Software , Conjuntos de Dados como Assunto , Humanos , Internet , Modelos Teóricos
13.
Artigo em Inglês | MEDLINE | ID: mdl-27087308

RESUMO

Relations between chemicals and diseases are one of the most queried biomedical interactions. Although expert manual curation is the standard method for extracting these relations from the literature, it is expensive and impractical to apply to large numbers of documents, and therefore alternative methods are required. We describe here a crowdsourcing workflow for extracting chemical-induced disease relations from free text as part of the BioCreative V Chemical Disease Relation challenge. Five non-expert workers on the CrowdFlower platform were shown each potential chemical-induced disease relation highlighted in the original source text and asked to make binary judgments about whether the text supported the relation. Worker responses were aggregated through voting, and relations receiving four or more votes were predicted as true. On the official evaluation dataset of 500 PubMed abstracts, the crowd attained a 0.505F-score (0.475 precision, 0.540 recall), with a maximum theoretical recall of 0.751 due to errors with named entity recognition. The total crowdsourcing cost was $1290.67 ($2.58 per abstract) and took a total of 7 h. A qualitative error analysis revealed that 46.66% of sampled errors were due to task limitations and gold standard errors, indicating that performance can still be improved. All code and results are publicly available athttps://github.com/SuLab/crowd_cid_relexDatabase URL:https://github.com/SuLab/crowd_cid_relex.


Assuntos
Crowdsourcing , Curadoria de Dados/métodos , Mineração de Dados/métodos , Bases de Dados Factuais , Doença/etiologia , Substâncias Perigosas/toxicidade , Humanos , Fluxo de Trabalho
14.
Artigo em Inglês | MEDLINE | ID: mdl-27022157

RESUMO

The last 20 years of advancement in sequencing technologies have led to sequencing thousands of microbial genomes, creating mountains of genetic data. While efficiency in generating the data improves almost daily, applying meaningful relationships between taxonomic and genetic entities on this scale requires a structured and integrative approach. Currently, knowledge is distributed across a fragmented landscape of resources from government-funded institutions such as National Center for Biotechnology Information (NCBI) and UniProt to topic-focused databases like the ODB3 database of prokaryotic operons, to the supplemental table of a primary publication. A major drawback to large scale, expert-curated databases is the expense of maintaining and extending them over time. No entity apart from a major institution with stable long-term funding can consider this, and their scope is limited considering the magnitude of microbial data being generated daily. Wikidata is an openly editable, semantic web compatible framework for knowledge representation. It is a project of the Wikimedia Foundation and offers knowledge integration capabilities ideally suited to the challenge of representing the exploding body of information about microbial genomics. We are developing a microbial specific data model, based on Wikidata's semantic web compatibility, which represents bacterial species, strains and the gene and gene products that define them. Currently, we have loaded 43,694 gene and 37,966 protein items for 21 species of bacteria, including the human pathogenic bacteriaChlamydia trachomatis.Using this pathogen as an example, we explore complex interactions between the pathogen, its host, associated genes, other microbes, disease and drugs using the Wikidata SPARQL endpoint. In our next phase of development, we will add another 99 bacterial genomes and their gene and gene products, totaling ∼900,000 additional entities. This aggregation of knowledge will be a platform for community-driven collaboration, allowing the networking of microbial genetic data through the sharing of knowledge by both the data and domain expert.


Assuntos
Curadoria de Dados , Genoma Microbiano , Modelos Teóricos , Feminino , Ontologia Genética , Genes Bacterianos , Humanos , Anotação de Sequência Molecular , Óperon/genética , Ferramenta de Busca
15.
Artigo em Inglês | MEDLINE | ID: mdl-26989148

RESUMO

Open biological data are distributed over many resources making them challenging to integrate, to update and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose and also provides tight integration with Wikipedia. In order to improve the state of biological data, facilitate data management and dissemination, we imported all human and mouse genes, and all human and mouse proteins into Wikidata. In total, 59,721 human genes and 73,355 mouse genes have been imported from NCBI and 27,306 human proteins and 16,728 mouse proteins have been imported from the Swissprot subset of UniProt. As Wikidata is open and can be edited by anybody, our corpus of imported data serves as the starting point for integration of further data by scientists, the Wikidata community and citizen scientists alike. The first use case for these data is to populate Wikipedia Gene Wiki infoboxes directly from Wikidata with the data integrated above. This enables immediate updates of the Gene Wiki infoboxes as soon as the data in Wikidata are modified. Although Gene Wiki pages are currently only on the English language version of Wikipedia, the multilingual nature of Wikidata allows for usage of the data we imported in all 280 different language Wikipedias. Apart from the Gene Wiki infobox use case, a SPARQL endpoint and exporting functionality to several standard formats (e.g. JSON, XML) enable use of the data by scientists. In summary, we created a fully open and extensible data resource for human and mouse molecular biology and biochemistry data. This resource enriches all the Wikipedias with structured information and serves as a new linking hub for the biological semantic web. Database URL: https://www.wikidata.org/.


Assuntos
Bases de Dados de Ácidos Nucleicos , Semântica , Animais , Humanos , Camundongos , Modelos Teóricos , Ferramenta de Busca
16.
PLoS One ; 11(2): e0149621, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26919047

RESUMO

High-throughput experimental methods such as medical sequencing and genome-wide association studies (GWAS) identify increasingly large numbers of potential relations between genetic variants and diseases. Both biological complexity (millions of potential gene-disease associations) and the accelerating rate of data production necessitate computational approaches to prioritize and rationalize potential gene-disease relations. Here, we use concept profile technology to expose from the biomedical literature both explicitly stated gene-disease relations (the explicitome) and a much larger set of implied gene-disease associations (the implicitome). Implicit relations are largely unknown to, or are even unintended by the original authors, but they vastly extend the reach of existing biomedical knowledge for identification and interpretation of gene-disease associations. The implicitome can be used in conjunction with experimental data resources to rationalize both known and novel associations. We demonstrate the usefulness of the implicitome by rationalizing known and novel gene-disease associations, including those from GWAS. To facilitate the re-use of implicit gene-disease associations, we publish our data in compliance with FAIR Data Publishing recommendations [https://www.force11.org/group/fairgroup] using nanopublications. An online tool (http://knowledge.bio) is available to explore established and potential gene-disease associations in the context of other biomedical relations.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos
17.
Brief Bioinform ; 17(1): 23-32, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25888696

RESUMO

The use of crowdsourcing to solve important but complex problems in biomedical and clinical sciences is growing and encompasses a wide variety of approaches. The crowd is diverse and includes online marketplace workers, health information seekers, science enthusiasts and domain experts. In this article, we review and highlight recent studies that use crowdsourcing to advance biomedicine. We classify these studies into two broad categories: (i) mining big data generated from a crowd (e.g. search logs) and (ii) active crowdsourcing via specific technical platforms, e.g. labor markets, wikis, scientific games and community challenges. Through describing each study in detail, we demonstrate the applicability of different methods in a variety of domains in biomedical research, including genomics, biocuration and clinical research. Furthermore, we discuss and highlight the strengths and limitations of different crowdsourcing platforms. Finally, we identify important emerging trends, opportunities and remaining challenges for future crowdsourcing research in biomedicine.


Assuntos
Crowdsourcing/tendências , Biologia Computacional/tendências , Mineração de Dados , Humanos , Internet , Ferramenta de Busca , Smartphone , Mídias Sociais , Jogos de Vídeo
18.
Citiz Sci ; 1(2)2016.
Artigo em Inglês | MEDLINE | ID: mdl-30416754

RESUMO

Biomedical literature represents one of the largest and fastest growing collections of unstructured biomedical knowledge. Finding critical information buried in the literature can be challenging. To extract information from free-flowing text, researchers need to: 1. identify the entities in the text (named entity recognition), 2. apply a standardized vocabulary to these entities (normalization), and 3. identify how entities in the text are related to one another (relationship extraction). Researchers have primarily approached these information extraction tasks through manual expert curation and computational methods. We have previously demonstrated that named entity recognition (NER) tasks can be crowdsourced to a group of non-experts via the paid microtask platform, Amazon Mechanical Turk (AMT), and can dramatically reduce the cost and increase the throughput of biocuration efforts. However, given the size of the biomedical literature, even information extraction via paid microtask platforms is not scalable. With our web-based application Mark2Cure (http://mark2cure.org), we demonstrate that NER tasks also can be performed by volunteer citizen scientists with high accuracy. We apply metrics from the Zooniverse Matrices of Citizen Science Success and provide the results here to serve as a basis of comparison for other citizen science projects. Further, we discuss design considerations, issues, and the application of analytics for successfully moving a crowdsourcing workflow from a paid microtask platform to a citizen science platform. To our knowledge, this study is the first application of citizen science to a natural language processing task.

19.
Pac Symp Biocomput ; : 267-9, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25592587

RESUMO

The following sections are included: Introduction, Session articles, Acknowledgements and References.


Assuntos
Crowdsourcing/estatística & dados numéricos , Mineração de Dados/estatística & dados numéricos , Biologia Computacional , Humanos
20.
Pac Symp Biocomput ; : 282-93, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25592589

RESUMO

Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses. Many biological natural language processing (BioNLP) projects attempt to address this challenge, but the state of the art still leaves much room for improvement. Progress in BioNLP research depends on large, annotated corpora for evaluating information extraction systems and training machine learning models. Traditionally, such corpora are created by small numbers of expert annotators often working over extended periods of time. Recent studies have shown that workers on microtask crowdsourcing platforms such as Amazon's Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text. Here, we investigated the use of the AMT in capturing disease mentions in PubMed abstracts. We used the NCBI Disease corpus as a gold standard for refining and benchmarking our crowdsourcing protocol. After several iterations, we arrived at a protocol that reproduced the annotations of the 593 documents in the 'training set' of this gold standard with an overall F measure of 0.872 (precision 0.862, recall 0.883). The output can also be tuned to optimize for precision (max = 0.984 when recall = 0.269) or recall (max = 0.980 when precision = 0.436). Each document was completed by 15 workers, and their annotations were merged based on a simple voting method. In total 145 workers combined to complete all 593 documents in the span of 9 days at a cost of $.066 per abstract per worker. The quality of the annotations, as judged with the F measure, increases with the number of workers assigned to each task; however minimal performance gains were observed beyond 8 workers per task. These results add further evidence that microtask crowdsourcing can be a valuable tool for generating well-annotated corpora in BioNLP. Data produced for this analysis are available at http://figshare.com/articles/Disease_Mention_Annotation_with_Mechanical_Turk/1126402.


Assuntos
Biologia Computacional/métodos , Crowdsourcing/métodos , PubMed , Indexação e Redação de Resumos , Adulto , Benchmarking , Biologia Computacional/economia , Custos e Análise de Custo , Crowdsourcing/economia , Curadoria de Dados/economia , Curadoria de Dados/métodos , Doença , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Processamento de Linguagem Natural , Aprendizado de Máquina Supervisionado , Unified Medical Language System , Adulto Jovem
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...