Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
1.
IEEE/ACM Trans Comput Biol Bioinform ; 19(3): 1772-1781, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-33306472

RESUMO

Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.


Assuntos
Redes Neurais de Computação , Proteínas , Bases de Dados de Proteínas , Proteínas/química , Alinhamento de Sequência , Software
2.
BMC Bioinformatics ; 21(Suppl 23): 580, 2020 Dec 29.
Artigo em Inglês | MEDLINE | ID: mdl-33372589

RESUMO

BACKGROUND:  : Syntactic analysis, or parsing, is a key task in natural language processing and a required component for many text mining approaches. In recent years, Universal Dependencies (UD) has emerged as the leading formalism for dependency parsing. While a number of recent tasks centering on UD have substantially advanced the state of the art in multilingual parsing, there has been only little study of parsing texts from specialized domains such as biomedicine. METHODS:  : We explore the application of state-of-the-art neural dependency parsing methods to biomedical text using the recently introduced CRAFT-SA shared task dataset. The CRAFT-SA task broadly follows the UD representation and recent UD task conventions, allowing us to fine-tune the UD-compatible Turku Neural Parser and UDify neural parsers to the task. We further evaluate the effect of transfer learning using a broad selection of BERT models, including several models pre-trained specifically for biomedical text processing. RESULTS:  : We find that recently introduced neural parsing technology is capable of generating highly accurate analyses of biomedical text, substantially improving on the best performance reported in the original CRAFT-SA shared task. We also find that initialization using a deep transfer learning model pre-trained on in-domain texts is key to maximizing the performance of the parsing methods.


Assuntos
Pesquisa Biomédica , Mineração de Dados , Software , Humanos , Idioma , Modelos Estatísticos , Processamento de Linguagem Natural
3.
J Biomed Semantics ; 11(1): 10, 2020 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-32873340

RESUMO

BACKGROUND: Up to 35% of nurses' working time is spent on care documentation. We describe the evaluation of a system aimed at assisting nurses in documenting patient care and potentially reducing the documentation workload. Our goal is to enable nurses to write or dictate nursing notes in a narrative manner without having to manually structure their text under subject headings. In the current care classification standard used in the targeted hospital, there are more than 500 subject headings to choose from, making it challenging and time consuming for nurses to use. METHODS: The task of the presented system is to automatically group sentences into paragraphs and assign subject headings. For classification the system relies on a neural network-based text classification model. The nursing notes are initially classified on sentence level. Subsequently coherent paragraphs are constructed from related sentences. RESULTS: Based on a manual evaluation conducted by a group of three domain experts, we find that in about 69% of the paragraphs formed by the system the topics of the sentences are coherent and the assigned paragraph headings correctly describe the topics. We also show that the use of a paragraph merging step reduces the number of paragraphs produced by 23% without affecting the performance of the system. CONCLUSIONS: The study shows that the presented system produces a coherent and logical structure for freely written nursing narratives and has the potential to reduce the time and effort nurses are currently spending on documenting care in hospitals.


Assuntos
Documentação , Enfermeiras e Enfermeiros , Automação , Hospitais , Idioma , Descritores
4.
J Am Med Inform Assoc ; 27(1): 81-88, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31605490

RESUMO

OBJECTIVE: This study focuses on the task of automatically assigning standardized (topical) subject headings to free-text sentences in clinical nursing notes. The underlying motivation is to support nurses when they document patient care by developing a computer system that can assist in incorporating suitable subject headings that reflect the documented topics. Central in this study is performance evaluation of several text classification methods to assess the feasibility of developing such a system. MATERIALS AND METHODS: Seven text classification methods are evaluated using a corpus of approximately 0.5 million nursing notes (5.5 million sentences) with 676 unique headings extracted from a Finnish university hospital. Several of these methods are based on artificial neural networks. Evaluation is first done in an automatic manner for all methods, then a manual error analysis is done on a sample. RESULTS: We find that a method based on a bidirectional long short-term memory network performs best with an average recall of 0.5435 when allowed to suggest 1 subject heading per sentence and 0.8954 when allowed to suggest 10 subject headings per sentence. However, other methods achieve comparable results. The manual analysis indicates that the predictions are better than what the automatic evaluation suggests. CONCLUSIONS: The results indicate that several of the tested methods perform well in suggesting the most appropriate subject headings on sentence level. Thus, we find it feasible to develop a text classification system that can support the use of standardized terminologies and save nurses time and effort on care documentation.


Assuntos
Indexação e Redação de Resumos/métodos , Processamento de Linguagem Natural , Registros de Enfermagem , Terminologia Padronizada em Enfermagem , Descritores , Registros Eletrônicos de Saúde , Finlândia
5.
Database (Oxford) ; 20182018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-30576487

RESUMO

Biomedical researchers regularly discover new interactions between chemical compounds/drugs and genes/proteins, and report them in research literature. Having knowledge about these interactions is crucially important in many research areas such as precision medicine and drug discovery. The BioCreative VI Task 5 (CHEMPROT) challenge promotes the development and evaluation of computer systems that can automatically recognize and extract statements of such interactions from biomedical literature. We participated in this challenge with a Support Vector Machine (SVM) system and a deep learning-based system (ST-ANN), and achieved an F-score of 60.99 for the task. After the shared task, we have significantly improved the performance of the ST-ANN system. Additionally, we have developed a new deep learning-based system (I-ANN) that considerably outperforms the ST-ANN system. Both ST-ANN and I-ANN systems are centered around training an ensemble of artificial neural networks and utilizing different bidirectional Long Short-Term Memory (LSTM) chains for representing the shortest dependency path and/or the full sentence. By combining the predictions of the SVM and the I-ANN systems, we achieved an F-score of 63.10 for the task, improving our previous F-score by 2.11 percentage points. Our systems are fully open-source and publicly available. We highlight that the systems we present in this study are not applicable only to the BioCreative VI Task 5, but can be effortlessly re-trained to extract any types of relations of interest, with no modifications of the source code required, if a manually annotated corpus is provided as training data in a specific file format.


Assuntos
Descoberta de Drogas/métodos , Redes Neurais de Computação , Preparações Farmacêuticas , Proteínas , Máquina de Vetores de Suporte , Mineração de Dados , Bases de Dados de Compostos Químicos , Bases de Dados de Proteínas , Aprendizado Profundo , Preparações Farmacêuticas/química , Preparações Farmacêuticas/metabolismo , Ligação Proteica , Proteínas/química , Proteínas/metabolismo
6.
J Am Med Inform Assoc ; 25(10): 1274-1283, 2018 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-30272184

RESUMO

Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. Materials and Methods: We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks. Results: Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems. Discussion: Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1). Conclusions: Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/classificação , Processamento de Linguagem Natural , Redes Neurais de Computação , Mídias Sociais/classificação , Máquina de Vetores de Suporte , Mineração de Dados/métodos , Humanos , Farmacovigilância
7.
Database (Oxford) ; 2018: 1-10, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-30239666

RESUMO

We present a system for automatically identifying a multitude of biomedical entities from the literature. This work is based on our previous efforts in the BioCreative VI: Interactive Bio-ID Assignment shared task in which our system demonstrated state-of-the-art performance with the highest achieved results in named entity recognition. In this paper we describe the original conditional random field-based system used in the shared task as well as experiments conducted since, including better hyperparameter tuning and character level modeling, which led to further performance improvements. For normalizing the mentions into unique identifiers we use fuzzy character n-gram matching. The normalization approach has also been improved with a better abbreviation resolution method and stricter guideline compliance resulting in vastly improved results for various entity types. All tools and models used for both named entity recognition and normalization are publicly available under open license.Database URL: https://github.com/TurkuNLP/BioCreativeVI_BioID_assignment.


Assuntos
Algoritmos , Lógica Fuzzy , Anotação de Sequência Molecular
8.
PeerJ ; 6: e4806, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29844966

RESUMO

The increasing move towards open access full-text scientific literature enhances our ability to utilize advanced text-mining methods to construct information-rich networks that no human will be able to grasp simply from 'reading the literature'. The utility of text-mining for well-studied species is obvious though the utility for less studied species, or those with no prior track-record at all, is not clear. Here we present a concept for how advanced text-mining can be used to create information-rich networks even for less well studied species and apply it to generate an open-access gene-gene association network resource for Synechocystis sp. PCC 6803, a representative model organism for cyanobacteria and first case-study for the methodology. By merging the text-mining network with networks generated from species-specific experimental data, network integration was used to enhance the accuracy of predicting novel interactions that are biologically relevant. A rule-based algorithm (filter) was constructed in order to automate the search for novel candidate genes with a high degree of likely association to known target genes by (1) ignoring established relationships from the existing literature, as they are already 'known', and (2) demanding multiple independent evidences for every novel and potentially relevant relationship. Using selected case studies, we demonstrate the utility of the network resource and filter to (i) discover novel candidate associations between different genes or proteins in the network, and (ii) rapidly evaluate the potential role of any one particular gene or protein. The full network is provided as an open-source resource.

9.
Stud Health Technol Inform ; 247: 725-729, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29678056

RESUMO

We report on the development and evaluation of a prototype tool aimed to assist laymen/patients in understanding the content of clinical narratives. The tool relies largely on unsupervised machine learning applied to two large corpora of unlabeled text - a clinical corpus and a general domain corpus. A joint semantic word-space model is created for the purpose of extracting easier to understand alternatives for words considered difficult to understand by laymen. Two domain experts evaluate the tool and inter-rater agreement is calculated. When having the tool suggest ten alternatives to each difficult word, it suggests acceptable lay words for 55.51% of them. This and future manual evaluation will serve to further improve performance, where also supervised machine learning will be used.


Assuntos
Compreensão , Narração , Processamento de Linguagem Natural , Semântica , Humanos , Aprendizado de Máquina Supervisionado , Aprendizado de Máquina não Supervisionado
10.
Genome Biol ; 17(1): 184, 2016 09 07.
Artigo em Inglês | MEDLINE | ID: mdl-27604469

RESUMO

BACKGROUND: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.


Assuntos
Biologia Computacional , Proteínas/química , Software , Relação Estrutura-Atividade , Algoritmos , Bases de Dados de Proteínas , Ontologia Genética , Humanos , Anotação de Sequência Molecular , Proteínas/genética
11.
J Biomed Semantics ; 7: 27, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27175227

RESUMO

BACKGROUND: Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. METHODS: Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. RESULTS: The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. AVAILABILITY: The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/.


Assuntos
Informática Médica/métodos , Processamento de Linguagem Natural , Aprendizado de Máquina Supervisionado , Aprendizado de Máquina não Supervisionado , Mineração de Dados , Bases de Dados Factuais
12.
Bioinformatics ; 32(2): 276-82, 2016 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-26428294

RESUMO

MOTIVATION: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. RESULTS: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers. AVAILABILITY AND IMPLEMENTATION: The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/. CONTACT: sukaew@utu.fi.


Assuntos
Mineração de Dados/métodos , Bases de Dados Factuais , Genes Letais , Neoplasias/patologia , Terminologia como Assunto , Inteligência Artificial , Linhagem Celular , Biologia Computacional/métodos , Humanos , Armazenamento e Recuperação da Informação , Aprendizado de Máquina , Neoplasias/genética , Publicações , Semântica
13.
BMC Bioinformatics ; 16 Suppl 16: S3, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26551766

RESUMO

BACKGROUND: Modern methods for mining biomolecular interactions from literature typically make predictions based solely on the immediate textual context, in effect a single sentence. No prior work has been published on extending this context to the information automatically gathered from the whole biomedical literature. Thus, our motivation for this study is to explore whether mutually supporting evidence, aggregated across several documents can be utilized to improve the performance of the state-of-the-art event extraction systems. RESULTS: In the GE task, our re-ranking approach led to a modest performance increase and resulted in the first rank of the official Shared Task results with 50.97% F-score. Additionally, in this paper we explore and evaluate the usage of distributed vector representations for this challenge. CONCLUSIONS: For the GRN task, we were able to produce a gene regulatory network from the EVEX data, warranting the use of such generic large-scale text mining data in network biology settings. A detailed performance and error analysis provides more insight into the relatively low recall rates.


Assuntos
Mineração de Dados , Redes Reguladoras de Genes , Anotação de Sequência Molecular , Processamento de Linguagem Natural
14.
BMC Med Inform Decis Mak ; 15 Suppl 2: S2, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26099735

RESUMO

Patients' health related information is stored in electronic health records (EHRs) by health service providers. These records include sequential documentation of care episodes in the form of clinical notes. EHRs are used throughout the health care sector by professionals, administrators and patients, primarily for clinical purposes, but also for secondary purposes such as decision support and research. The vast amounts of information in EHR systems complicate information management and increase the risk of information overload. Therefore, clinicians and researchers need new tools to manage the information stored in the EHRs. A common use case is, given a--possibly unfinished--care episode, to retrieve the most similar care episodes among the records. This paper presents several methods for information retrieval, focusing on care episode retrieval, based on textual similarity, where similarity is measured through domain-specific modelling of the distributional semantics of words. Models include variants of random indexing and the semantic neural network model word2vec. Two novel methods are introduced that utilize the ICD-10 codes attached to care episodes to better induce domain-specificity in the semantic model. We report on experimental evaluation of care episode retrieval that circumvents the lack of human judgements regarding episode relevance. Results suggest that several of the methods proposed outperform a state-of-the art search engine (Lucene) on the retrieval task.


Assuntos
Codificação Clínica/normas , Sistemas de Apoio a Decisões Clínicas/organização & administração , Registros Eletrônicos de Saúde/organização & administração , Cuidado Periódico , Gestão da Informação em Saúde/organização & administração , Armazenamento e Recuperação da Informação/métodos , Algoritmos , Codificação Clínica/métodos , Gestão da Informação em Saúde/métodos , Humanos , Classificação Internacional de Doenças , Modelos Teóricos , Semântica
15.
Artif Intell Med ; 61(3): 131-6, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24680097

RESUMO

OBJECTIVES: In this paper, we study the development and domain-adaptation of statistical syntactic parsers for three different clinical domains in Finnish. METHODS AND MATERIALS: The materials include text from daily nursing notes written by nurses in an intensive care unit, physicians' notes from cardiology patients' health records, and daily nursing notes from cardiology patients' health records. The parsing is performed with the statistical parser of Bohnet (http://code.google.com/p/mate-tools/, accessed: 22 November 2013). RESULTS: A parser trained only on general language performs poorly in all clinical subdomains, the labelled attachment score (LAS) ranging from 59.4% to 71.4%, whereas domain data combined with general language gives better results, the LAS varying between 67.2% and 81.7%. However, even a small amount of clinical domain data quickly outperforms this and also clinical data from other domains is more beneficial (LAS 71.3-80.0%) than general language only. The best results (LAS 77.4-84.6%) are achieved by using as training data the combination of all the clinical treebanks. CONCLUSIONS: In order to develop a good syntactic parser for clinical language variants, a general language resource is not mandatory, while data from clinical fields is. However, in addition to the exact same clinical domain, also data from other clinical domains is useful.


Assuntos
Idioma , Terminologia como Assunto , Mineração de Dados , Finlândia , Humanos , Unidades de Terapia Intensiva , Processamento de Linguagem Natural , Enfermeiras e Enfermeiros , Casas de Saúde , Médicos
16.
PLoS One ; 8(4): e55814, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23613707

RESUMO

Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons - Attribution - Share Alike (CC BY-SA) license.


Assuntos
Mineração de Dados , Genes , Publicações , Algoritmos , Família Multigênica , Padrões de Referência , Transdução de Sinais/genética , Estatística como Assunto
17.
BMC Bioinformatics ; 13 Suppl 11: S4, 2012 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-22759458

RESUMO

BACKGROUND: We present a system for extracting biomedical events (detailed descriptions of biomolecular interactions) from research articles, developed for the BioNLP'11 Shared Task. Our goal is to develop a system easily adaptable to different event schemes, following the theme of the BioNLP'11 Shared Task: generalization, the extension of event extraction to varied biomedical domains. Our system extends our BioNLP'09 Shared Task winning Turku Event Extraction System, which uses support vector machines to first detect event-defining words, followed by detection of their relationships. RESULTS: Our current system successfully predicts events for every domain case introduced in the BioNLP'11 Shared Task, being the only system to participate in all eight tasks and all of their subtasks, with best performance in four tasks. Following the Shared Task, we improve the system on the Infectious Diseases task from 42.57% to 53.87% F-score, bringing performance into line with the similar GENIA Event Extraction and Epigenetics and Post-translational Modifications tasks. We evaluate the machine learning performance of the system by calculating learning curves for all tasks, detecting areas where additional annotated data could be used to improve performance. Finally, we evaluate the use of system output on external articles as additional training data in a form of self-training. CONCLUSIONS: We show that the updated Turku Event Extraction System can easily be adapted to all presently available event extraction targets, with competitive performance in most tasks. The scope of the performance gains between the 2009 and 2011 BioNLP Shared Tasks indicates event extraction is still a new field requiring more work. We provide several analyses of event extraction methods and performance, highlighting potential future directions for continued development.


Assuntos
Inteligência Artificial , Mineração de Dados , Processamento de Linguagem Natural , Bactérias/classificação , Bactérias/genética , Doenças Transmissíveis/metabolismo , Ecossistema , Epigenômica , Epistasia Genética , Genes Bacterianos , Humanos , Processamento de Proteína Pós-Traducional , Proteínas/genética , Máquina de Vetores de Suporte , Terminologia como Assunto
18.
Adv Bioinformatics ; 2012: 582765, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22719757

RESUMO

Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.

19.
BMC Bioinformatics ; 12: 481, 2011 Dec 18.
Artigo em Inglês | MEDLINE | ID: mdl-22177292

RESUMO

BACKGROUND: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. RESULTS: We have integrated nine event extraction systems in the U-Compare framework, making them intercompatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. CONCLUSIONS: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.


Assuntos
Mineração de Dados , Sistemas Computacionais , Publicações Periódicas como Assunto , Software
20.
Bioinformatics ; 26(12): i382-90, 2010 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-20529932

RESUMO

MOTIVATION: There has recently been a notable shift in biomedical information extraction (IE) from relation models toward the more expressive event model, facilitated by the maturation of basic tools for biomedical text analysis and the availability of manually annotated resources. The event model allows detailed representation of complex natural language statements and can support a number of advanced text mining applications ranging from semantic search to pathway extraction. A recent collaborative evaluation demonstrated the potential of event extraction systems, yet there have so far been no studies of the generalization ability of the systems nor the feasibility of large-scale extraction. RESULTS: This study considers event-based IE at PubMed scale. We introduce a system combining publicly available, state-of-the-art methods for domain parsing, named entity recognition and event extraction, and test the system on a representative 1% sample of all PubMed citations. We present the first evaluation of the generalization performance of event extraction systems to this scale and show that despite its computational complexity, event extraction from the entire PubMed is feasible. We further illustrate the value of the extraction approach through a number of analyses of the extracted information. AVAILABILITY: The event detection system and extracted data are open source licensed and available at http://bionlp.utu.fi/.


Assuntos
Mineração de Dados/métodos , PubMed , Processamento de Linguagem Natural , Biologia de Sistemas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...