Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
iScience ; 27(3): 109309, 2024 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-38482491

RESUMO

Experimental analysis of functionally related genes is key to understanding biological phenomena. The selection of genes to study is a crucial and challenging step, as it requires extensive knowledge of the literature and diverse biomedical data resources. Although software tools that predict relationships between genes are available to accelerate this process, they do not directly incorporate experiment information derived from the literature. Here, we develop LEXAS, a target gene suggestion system for molecular biology experiments. LEXAS is based on machine learning models trained with diverse information sources, including 24 million experiment descriptions extracted from full-text articles in PubMed Central by using a deep-learning-based natural language processing model. By integrating the extracted experiment contexts with biomedical data sources, LEXAS suggests potential target genes for upcoming experiments, complementing existing tools like STRING, FunCoup, and GOSemSim. A simple web interface enables biologists to consider newly derived gene information while planning experiments.

2.
Brain Nerve ; 71(1): 45-55, 2019 Jan.
Artigo em Japonês | MEDLINE | ID: mdl-30630129

RESUMO

The field of natural language processing (NLP) has seen rapid advances in the past several years since the introduction of deep learning techniques. A variety of NLP tasks including syntactic parsing, machine translation, and summarization can now be performed by relatively simple combinations of general neural network models such as recurrent neural networks and attention mechanisms. This manuscript gives a brief introduction to deep learning and an overview of the current deep learning-based NLP technology.


Assuntos
Aprendizado Profundo , Processamento de Linguagem Natural , Redes Neurais de Computação
3.
J Biomed Inform ; 56: 94-102, 2015 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-26004792

RESUMO

Many text mining applications in the biomedical domain benefit from automatic clustering of relational phrases into synonymous groups, since it alleviates the problem of spurious mismatches caused by the diversity of natural language expressions. Most of the previous work that has addressed this task of synonymy resolution uses similarity metrics between relational phrases based on textual strings or dependency paths, which, for the most part, ignore the context around the relations. To overcome this shortcoming, we employ a word embedding technique to encode relational phrases. We then apply the k-means algorithm on top of the distributional representations to cluster the phrases. Our experimental results show that this approach outperforms state-of-the-art statistical models including latent Dirichlet allocation and Markov logic networks.


Assuntos
Mineração de Dados/métodos , Processamento de Linguagem Natural , Vocabulário Controlado , Algoritmos , Análise por Conglomerados , Bases de Dados Factuais , Reações Falso-Positivas , Lógica Fuzzy , MEDLINE , Cadeias de Markov , Informática Médica/métodos , Modelos Estatísticos , Probabilidade , Reprodutibilidade dos Testes , Semântica
4.
BMC Bioinformatics ; 16: 107, 2015 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-25887686

RESUMO

BACKGROUND: Relation extraction is a fundamental technology in biomedical text mining. Most of the previous studies on relation extraction from biomedical literature have focused on specific or predefined types of relations, which inherently limits the types of the extracted relations. With the aim of fully leveraging the knowledge described in the literature, we address much broader types of semantic relations using a single extraction framework. RESULTS: Our system, which we name PASMED, extracts diverse types of binary relations from biomedical literature using deep syntactic patterns. Our experimental results demonstrate that it achieves a level of recall considerably higher than the state of the art, while maintaining reasonable precision. We have then applied PASMED to the whole MEDLINE corpus and extracted more than 137 million semantic relations. The extracted relations provide a quantitative understanding of what kinds of semantic relations are actually described in MEDLINE and can be ultimately extracted by (possibly type-specific) relation extraction systems. CONCLUSION: PASMED extracts a large number of relations that have previously been missed by existing text mining systems. The entire collection of the relations extracted from MEDLINE is publicly available in machine-readable form, so that it can serve as a potential knowledge base for high-level text-mining applications.


Assuntos
Mineração de Dados/métodos , MEDLINE , Semântica
5.
Bioinformatics ; 27(13): i111-9, 2011 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-21685059

RESUMO

MOTIVATION: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text-mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text-mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner. RESULTS: This article describes FACTA+, a real-time text-mining system for finding and visualizing indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualize indirect associations between important biomedical concepts such as genes, diseases and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (i) detecting biomolecular events in text using a machine learning model, (ii) discovering hidden associations using co-occurrence statistics between concepts, and (iii) visualizing associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving biomolecular events and visualizing indirect associations of concepts with both their categories and importance. AVAILABILITY: FACTA+ is available as a web application at http://refine1-nactem.mc.man.ac.uk/facta/, and its visualizer is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/. CONTACT: tsuruoka@jaist.ac.jp.


Assuntos
Inteligência Artificial , Mineração de Dados , Aplicações da Informática Médica , Internet , MEDLINE , PubMed , Estados Unidos
6.
Bioinformatics ; 27(8): 1185-6, 2011 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-21349873

RESUMO

UNLABELLED: Often, the most informative genes have to be selected from different gene sets and several computer gene ranking algorithms have been developed to cope with the problem. To help researchers decide which algorithm to use, we developed the analysis of gene ranking algorithms (AGRA) system that offers a novel technique for comparing ranked lists of genes. The most important feature of AGRA is that no previous knowledge of gene ranking algorithms is needed for their comparison. Using the text mining system finding-associated concepts with text analysis. AGRA defines what we call biomedical concept space (BCS) for each gene list and offers a comparison of the gene lists in six different BCS categories. The uploaded gene lists can be compared using two different methods. In the first method, the overlap between each pair of two gene lists of BCSs is calculated. The second method offers a text field where a specific biomedical concept can be entered. AGRA searches for this concept in each gene lists' BCS, highlights the rank of the concept and offers a visual representation of concepts ranked above and below it. AVAILABILITY AND IMPLEMENTATION: Available at http://agra.fzv.uni-mb.si/, implemented in Java and running on the Glassfish server. CONTACT: simon.kocbek@uni-mb.si.


Assuntos
Algoritmos , Genes , Mineração de Dados , Software
7.
BMC Syst Biol ; 4: 114, 2010 Aug 16.
Artigo em Inglês | MEDLINE | ID: mdl-20712863

RESUMO

BACKGROUND: Genome-scale metabolic reconstructions have been recognised as a valuable tool for a variety of applications ranging from metabolic engineering to evolutionary studies. However, the reconstruction of such networks remains an arduous process requiring a high level of human intervention. This process is further complicated by occurrences of missing or conflicting information and the absence of common annotation standards between different data sources. RESULTS: In this article, we report a semi-automated methodology aimed at streamlining the process of metabolic network reconstruction by enabling the integration of different genome-wide databases of metabolic reactions. We present results obtained by applying this methodology to the metabolic network of the plant Arabidopsis thaliana. A systematic comparison of compounds and reactions between two genome-wide databases allowed us to obtain a high-quality core consensus reconstruction, which was validated for stoichiometric consistency. A lower level of consensus led to a larger reconstruction, which has a lower quality standard but provides a baseline for further manual curation. CONCLUSION: This semi-automated methodology may be applied to other organisms and help to streamline the process of genome-scale network reconstruction in order to accelerate the transfer of such models to applications.


Assuntos
Bases de Dados Factuais , Genômica/métodos , Redes e Vias Metabólicas , Arabidopsis/genética , Arabidopsis/metabolismo , Genoma , Reprodutibilidade dos Testes
8.
Bioinformatics ; 26(12): i374-81, 2010 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-20529930

RESUMO

MOTIVATION: Metabolic and signaling pathways are an increasingly important part of organizing knowledge in systems biology. They serve to integrate collective interpretations of facts scattered throughout literature. Biologists construct a pathway by reading a large number of articles and interpreting them as a consistent network, but most of the models constructed currently lack direct links to those articles. Biologists who want to check the original articles have to spend substantial amounts of time to collect relevant articles and identify the sections relevant to the pathway. Furthermore, with the scientific literature expanding by several thousand papers per week, keeping a model relevant requires a continuous curation effort. In this article, we present a system designed to integrate a pathway visualizer, text mining systems and annotation tools into a seamless environment. This will enable biologists to freely move between parts of a pathway and relevant sections of articles, as well as identify relevant papers from large text bases. The system, PathText, is developed by Systems Biology Institute, Okinawa Institute of Science and Technology, National Centre for Text Mining (University of Manchester) and the University of Tokyo, and is being used by groups of biologists from these locations.


Assuntos
Mineração de Dados/métodos , Software , Fenômenos Biológicos , Biologia de Sistemas
9.
BMC Bioinformatics ; 9 Suppl 11: S5, 2008 Nov 19.
Artigo em Inglês | MEDLINE | ID: mdl-19025691

RESUMO

BACKGROUND: When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity. METHODS: We have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached. RESULTS: We evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names. CONCLUSION: Our approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER.


Assuntos
Biologia Computacional/métodos , Dicionários como Assunto , Reconhecimento Automatizado de Padrão/métodos , Modelos Estatísticos , Sistemas On-Line , Proteômica/métodos , Terminologia como Assunto
10.
BMC Bioinformatics ; 9 Suppl 11: S8, 2008 Nov 19.
Artigo em Inglês | MEDLINE | ID: mdl-19025694

RESUMO

BACKGROUND: Previous studies of named entity recognition have shown that a reasonable level of recognition accuracy can be achieved by using machine learning models such as conditional random fields or support vector machines. However, the lack of training data (i.e. annotated corpora) makes it difficult for machine learning-based named entity recognizers to be used in building practical information extraction systems. RESULTS: This paper presents an active learning-like framework for reducing the human effort required to create named entity annotations in a corpus. In this framework, the annotation work is performed as an iterative and interactive process between the human annotator and a probabilistic named entity tagger. Unlike active learning, our framework aims to annotate all occurrences of the target named entities in the given corpus, so that the resulting annotations are free from the sampling bias which is inevitable in active learning approaches. CONCLUSION: We evaluate our framework by simulating the annotation process using two named entity corpora and show that our approach can reduce the number of sentences which need to be examined by the human annotator. The cost reduction achieved by the framework could be drastic when the target named entities are sparse.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Reconhecimento Automatizado de Padrão/métodos , Terminologia como Assunto , Algoritmos , Inteligência Artificial , Bases de Dados Bibliográficas , Processamento de Linguagem Natural
11.
Bioinformatics ; 24(21): 2559-60, 2008 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-18772154

RESUMO

UNLABELLED: FACTA is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics. Unlike existing systems that provide similar functionality, FACTA pre-indexes not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large. The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts. The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri, such as UniProt, BioThesaurus, UMLS, KEGG and DrugBank. AVAILABILITY: The system is available at http://www.nactem.ac.uk/software/facta/


Assuntos
Indexação e Redação de Resumos/métodos , MEDLINE , Software , Sistemas de Gerenciamento de Base de Dados
12.
BMC Bioinformatics ; 9 Suppl 3: S2, 2008 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-18426547

RESUMO

BACKGROUND: One of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach. RESULTS: We present a novel framework for discovering a list of normalization rules from a dictionary in a fully automated manner. The rules are discovered in such a way that they minimize the ambiguity and variability of the terms in the dictionary. We evaluated our algorithm using two large dictionaries: a human gene/protein name dictionary built from BioThesaurus and a disease name dictionary built from UMLS. CONCLUSIONS: The experimental results showed that automatically discovered rules can perform comparably to carefully crafted heuristic rules in term mapping tasks, and the computational overhead of rule application is small enough that a very fast implementation is possible. This work will help improve the performance of term-concept mapping tasks in biomedical information extraction especially when good normalization heuristics for the target terminology are not fully known.


Assuntos
Algoritmos , Artefatos , Inteligência Artificial , Dicionários como Assunto , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Terminologia como Assunto , Vocabulário Controlado
13.
Pac Symp Biocomput ; : 616-27, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18229720

RESUMO

Recently, several text mining programs have reached a near-practical level of performance. Some systems are already being used by biologists and database curators. However, it has also been recognized that current Natural Language Processing (NLP) and Text Mining (TM) technology is not easy to deploy, since research groups tend to develop systems that cater specifically to their own requirements. One of the major reasons for the difficulty of deployment of NLP/TM technology is that re-usability and interoperability of software tools are typically not considered during development. While some effort has been invested in making interoperable NLP/TM toolkits, the developers of end-to-end systems still often struggle to reuse NLP/TM tools, and often opt to develop similar programs from scratch instead. This is particularly the case in BioNLP, since the requirements of biologists are so diverse that NLP tools have to be adapted and re-organized in a much more extensive manner than was originally expected. Although generic frameworks like UIMA (Unstructured Information Management Architecture) provide promising ways to solve this problem, the solution that they provide is only partial. In order for truly interoperable toolkits to become a reality, we also need sharable type systems and a developer-friendly environment for software integration that includes functionality for systematic comparisons of available tools, a simple I/O interface, and visualization tools. In this paper, we describe such an environment that was developed based on UIMA, and we show its feasibility through our experience in developing a protein-protein interaction (PPI) extraction system.


Assuntos
Biologia Computacional , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural
14.
Bioinformatics ; 23(20): 2768-74, 2007 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-17698493

RESUMO

MOTIVATION: One of the bottlenecks of biomedical data integration is variation of terms. Exact string matching often fails to associate a name with its biological concept, i.e. ID or accession number in the database, due to seemingly small differences of names. Soft string matching potentially enables us to find the relevant ID by considering the similarity between the names. However, the accuracy of soft matching highly depends on the similarity measure employed. RESULTS: We used logistic regression for learning a string similarity measure from a dictionary. Experiments using several large-scale gene/protein name dictionaries showed that the logistic regression-based similarity measure outperforms existing similarity measures in dictionary look-up tasks. AVAILABILITY: A dictionary look-up system using the similarity measures described in this article is available at http://text0.mib.man.ac.uk/software/mldic/.


Assuntos
Inteligência Artificial , Bases de Dados de Proteínas , Genes , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Proteínas/classificação , Terminologia como Assunto , Modelos Logísticos , Análise de Regressão
15.
BMC Bioinformatics ; 7 Suppl 3: S4, 2006 Nov 24.
Artigo em Inglês | MEDLINE | ID: mdl-17134477

RESUMO

BACKGROUND: Automatic recognition of relations between a specific disease term and its relevant genes or protein terms is an important practice of bioinformatics. Considering the utility of the results of this approach, we identified prostate cancer and gene terms with the ID tags of public biomedical databases. Moreover, considering that genetics experts will use our results, we classified them based on six topics that can be used to analyze the type of prostate cancers, genes, and their relations. METHODS: We developed a maximum entropy-based named entity recognizer and a relation recognizer and applied them to a corpus-based approach. We collected prostate cancer-related abstracts from MEDLINE, and constructed an annotated corpus of gene and prostate cancer relations based on six topics by biologists. We used it to train the maximum entropy-based named entity recognizer and relation recognizer. RESULTS: Topic-classified relation recognition achieved 92.1% precision for the relation (an increase of 11.0% from that obtained in a baseline experiment). For all topics, the precision was between 67.6 and 88.1%. CONCLUSION: A series of experimental results revealed two important findings: a carefully designed relation recognition system using named entity recognition can improve the performance of relation recognition, and topic-classified relation recognition can be effectively addressed through a corpus-based approach using manual annotation and machine learning techniques.


Assuntos
Indexação e Redação de Resumos/métodos , Inteligência Artificial , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Processamento de Linguagem Natural , Proteínas de Neoplasias/classificação , Neoplasias da Próstata/classificação , Algoritmos , Bases de Dados Factuais , Genes/genética , Humanos , Masculino , Proteínas de Neoplasias/genética , Publicações Periódicas como Assunto , Neoplasias da Próstata/genética , Semântica , Software , Terminologia como Assunto , Vocabulário Controlado
16.
Pac Symp Biocomput ; : 4-15, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-17094223

RESUMO

We describe a system that extracts disease-gene relations from Medline. We constructed a dictionary for disease and gene names from six public databases and extracted relation candidates by dictionary matching. Since dictionary matching produces a large number of false positives, we developed a method of machine learning-based named entity recognition (NER) to filter out false recognitions of disease/gene names. We found that the performance of relation extraction is heavily dependent upon the performance of NER filtering and that the filtering improves the precision of relation extraction by 26.7% at the cost of a small reduction in recall.


Assuntos
Inteligência Artificial , Doença , Genes , MEDLINE , Animais , Metodologias Computacionais , Dicionários Médicos como Assunto , Humanos , Terminologia como Assunto , Unified Medical Language System
17.
J Biomed Inform ; 37(6): 461-70, 2004 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-15542019

RESUMO

Dictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%.


Assuntos
Indexação e Redação de Resumos/métodos , Biologia Computacional/métodos , Armazenamento e Recuperação da Informação/métodos , Proteínas/química , Algoritmos , Animais , Inteligência Artificial , Teorema de Bayes , Bases de Dados Bibliográficas , Bases de Dados de Proteínas , Dicionários como Assunto , Humanos , Nomes , Probabilidade , Software
18.
In Silico Biol ; 4(1): 31-54, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15089752

RESUMO

As a first step toward the quantitative comparison of clinical features of diseases, we indexed the text descriptions in the Clinical Synopsis section of the Online Mendelian Inheritance in Man (OMIM) with concepts for the body parts, organs, and tissues contained in the Metathesaurus of the Unified Medical Language System (UMLS). We also indexed the text with the diseases and disorders having links to body parts specified in the thesaurus. The vocabulary size was approximately 177,540 representations for 81,435 concepts, and 2,161 concepts were indexed to 3,779 OMIM entries. The indexed concepts included 134 concepts for the noun forms of anatomical concepts and 985 indexed concepts for diseases and disorders that were linked to 132 and 408 anatomical concepts, respectively. We report herein that the retrieval of OMIM entries for diseases affecting specific organs can be made more comprehensive through the anatomical concepts indexed to the Clinical Synopsis or linked to the indexed concepts, as compared to simply matching organ names to the Clinical Synopsis text. The recall and precision of identifying relevant body parts in the Clinical Synopsis were calculated as 78% and 92.5%, respectively, based on random sampling. The examination of the unidentified body parts due to lack of indexed diseases and disorders showed that although most of the concepts for diseases and disorders were contained in the Metathesaurus, their relations to body parts were not. The indexing result proved the effectiveness of the Metathesaurus as a resource for the identification of concepts indicating body parts, diseases, and disorders.


Assuntos
Indexação e Redação de Resumos/métodos , Anatomia , Bases de Dados Genéticas , Diagnóstico , Unified Medical Language System , Humanos , Terminologia como Assunto
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...