Pesquisa | Portal Regional da BVS (teste)

Developing a disease outbreak event corpus.

Conway, Mike; Kawazoe, Ai; Chanlekha, Hutchatai; Collier, Nigel.

J Med Internet Res ; 12(3): e43, 2010 Sep 28.

Artigo em Inglês | MEDLINE | ID: mdl-20876049

RESUMO

BACKGROUND: In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking. OBJECTIVE: This study seeks to create a "gold standard" data set against which to test how accurately disease outbreak information extraction systems can identify the semantics of disease outbreak events. Additionally, we hope that the provision of an annotation scheme (and associated corpus) to the community will encourage open evaluation in this new and growing application area. METHODS: We developed an annotation scheme for identifying infectious disease outbreak events in news texts. An event--in the context of our annotation scheme--consists minimally of geographical (eg, country and province) and disease name information. However, the scheme also allows for the rich encoding of other domain salient concepts (eg, international travel, species, and food contamination). RESULTS: The work resulted in a 200-document corpus of event-annotated disease outbreak reports that can be used to evaluate the accuracy of event detection algorithms (in this case, for the BioCaster biosurveillance online news information extraction system). In the 200 documents, 394 distinct events were identified (mean 1.97 events per document, range 0-25 events per document). We also provide a download script and graphical user interface (GUI)-based event browsing software to facilitate corpus exploration. CONCLUSION: In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus were designed both with the particular evaluation requirements of the BioCaster system in mind as well as the wider need for further evaluation resources in this growing research area.

Assuntos

Surtos de Doenças/estatística & dados numéricos , Sistemas On-Line , Animais , Surtos de Doenças/prevenção & controle , Documentação , Processamento Eletrônico de Dados/métodos , Geografia , Humanos , Organização Mundial da Saúde

A framework for enhancing spatial and temporal granularity in report-based health surveillance systems.

Chanlekha, Hutchatai; Kawazoe, Ai; Collier, Nigel.

BMC Med Inform Decis Mak ; 10: 1, 2010 Jan 12.

Artigo em Inglês | MEDLINE | ID: mdl-20067612

RESUMO

BACKGROUND: Current public concern over the spread of infectious diseases has underscored the importance of health surveillance systems for the speedy detection of disease outbreaks. Several international report-based monitoring systems have been developed, including GPHIN, Argus, HealthMap, and BioCaster. A vital feature of these report-based systems is the geo-temporal encoding of outbreak-related textual data. Until now, automated systems have tended to use an ad-hoc strategy for processing geo-temporal information, normally involving the detection of locations that match pre-determined criteria, and the use of document publication dates as a proxy for disease event dates. Although these strategies appear to be effective enough for reporting events at the country and province levels, they may be less effective at discovering geo-temporal information at more detailed levels of granularity. In order to improve the capabilities of current Web-based health surveillance systems, we introduce the design for a novel scheme called spatiotemporal zoning. METHOD: The proposed scheme classifies news articles into zones according to the spatiotemporal characteristics of their content. In order to study the reliability of the annotation scheme, we analyzed the inter-annotator agreements on a group of human annotators for over 1000 reported events. Qualitative and quantitative evaluation is made on the results including the kappa and percentage agreement. RESULTS: The reliability evaluation of our scheme yielded very promising inter-annotator agreement, more than a 0.9 kappa and a 0.9 percentage agreement for event type annotation and temporal attributes annotation, respectively, with a slight degradation for the spatial attribute. However, for events indicating an outbreak situation, the annotators usually had inter-annotator agreements with the lowest granularity location. CONCLUSIONS: We developed and evaluated a novel spatiotemporal zoning annotation scheme. The results of the scheme evaluation indicate that our annotated corpus and the proposed annotation scheme are reliable and could be effectively used for developing an automatic system. Given the current advances in natural language processing techniques, including the availability of language resources and tools, we believe that a reliable automatic spatiotemporal zoning system can be achieved. In the next stage of this work, we plan to develop an automatic zoning system and evaluate its usability within an operational health surveillance system.

Assuntos

Surtos de Doenças/classificação , Sistemas de Informação Geográfica , Processamento de Linguagem Natural , Vigilância da População/métodos , Demografia , Humanos , Meios de Comunicação de Massa , Informática em Saúde Pública , Reprodutibilidade dos Testes , Projetos de Pesquisa

Classifying disease outbreak reports using n-grams and semantic features.

Conway, Mike; Doan, Son; Kawazoe, Ai; Collier, Nigel.

Int J Med Inform ; 78(12): e47-58, 2009 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-19447070

RESUMO

INTRODUCTION: This paper explores the benefits of using n-grams and semantic features for the classification of disease outbreak reports, in the context of the BioCaster disease outbreak report text mining system. A novel feature of this work is the use of a general purpose semantic tagger - the USAS tagger - to generate features. BACKGROUND: We outline the application context for this work (the BioCaster epidemiological text mining system), before going on to describe the experimental data used in our classification experiments (the 1000 document BioCaster corpus). FEATURE SETS: Three broad groups of features are used in this work: Named Entity based features, n-gram features, and features derived from the USAS semantic tagger. METHODOLOGY: Three standard machine learning algorithms - Naïve Bayes, the Support Vector Machine algorithm, and the C4.5 decision tree algorithm - were used for classifying experimental data (that is, the BioCaster corpus). Feature selection was performed using the chi(2) feature selection algorithm. Standard text classification performance metrics - Accuracy, Precision, Recall, Specificity and F-score - are reported. RESULTS: A feature representation composed of unigrams, bigrams, trigrams and features derived from a semantic tagger, in conjunction with the Naïve Bayes algorithm and feature selection yielded the highest classification accuracy (and F-score). This result was statistically significant compared to a baseline unigram representation and to previous work on the same task. However, it was feature selection rather than semantic tagging that contributed most to the improved performance. CONCLUSION: This study has shown that for the classification of disease outbreak reports, a combination of bag-of-words, n-grams and semantic features, in conjunction with feature selection, increases classification accuracy at a statistically significant level compared to previous work in this domain.

Assuntos

Notificação de Doenças/métodos , Surtos de Doenças/classificação , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Semântica , Humanos , Internet , Sistemas Computadorizados de Registros Médicos

Towards role-based filtering of disease outbreak reports.

Doan, Son; Kawazoe, Ai; Conway, Mike; Collier, Nigel.

J Biomed Inform ; 42(5): 773-80, 2009 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-19171201

RESUMO

This paper explores the role of named entities (NEs) in the classification of disease outbreak report. In the annotation schema of BioCaster, a text mining system for public health protection, important concepts that reflect information about infectious diseases were conceptually analyzed with a formal ontological methodology and classified into types and roles. Types are specified as NE classes and roles are integrated into NEs as attributes such as a chemical and whether it is being used as a therapy for some infectious disease. We focus on the roles of NEs and explore different ways to extract, combine and use them as features in a text classifier. In addition, we investigate the combination of roles with semantic categories of disease-related nouns and verbs. Experimental results using naïve Bayes and Support Vector Machine (SVM) algorithms show that: (1) roles in combination with NEs improve performance in text classification, (2) roles in combination with semantic categories of noun and verb features contribute substantially to the improvement of text classification. Both these results were statistically significant compared to the baseline "raw text" representation. We discuss in detail the effects of roles on each NE and on semantic categories of noun and verb features in terms of accuracy, precision/recall and F-score measures for the text classification task.

Assuntos

Surtos de Doenças , Armazenamento e Recuperação da Informação/métodos , Informática Médica/métodos , Processamento de Linguagem Natural , Algoritmos , Inteligência Artificial , Teorema de Bayes , Humanos , Reconhecimento Automatizado de Padrão , Vigilância da População

BioCaster: detecting public health rumors with a Web-based text mining system.

Collier, Nigel; Doan, Son; Kawazoe, Ai; Goodwin, Reiko Matsuda; Conway, Mike; Tateno, Yoshio; Ngo, Quoc-Hung; Dien, Dinh; Kawtrakul, Asanee; Takeuchi, Koichi; Shigematsu, Mika; Taniguchi, Kiyosu.

Bioinformatics ; 24(24): 2940-1, 2008 Dec 15.

Artigo em Inglês | MEDLINE | ID: mdl-18922806

RESUMO

SUMMARY: BioCaster is an ontology-based text mining system for detecting and tracking the distribution of infectious disease outbreaks from linguistic signals on the Web. The system continuously analyzes documents reported from over 1700 RSS feeds, classifies them for topical relevance and plots them onto a Google map using geocoded information. The background knowledge for bridging the gap between Layman's terms and formal-coding systems is contained in the freely available BioCaster ontology which includes information in eight languages focused on the epidemiological role of pathogens as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and event recognition. Higher order event analysis is used to detect more precisely specified warning signals that can then be notified to registered users via email alerts. Evaluation of the system for topic recognition and entity identification is conducted on a gold standard corpus of annotated news articles. AVAILABILITY: The BioCaster map and ontology are freely available via a web portal at http://www.biocaster.org.

Assuntos

Armazenamento e Recuperação da Informação/métodos , Vigilância da População , Software , Humanos , Internet , Saúde Pública

Structuring an event ontology for disease outbreak detection.

Kawazoe, Ai; Chanlekha, Hutchatai; Shigematsu, Mika; Collier, Nigel.

BMC Bioinformatics ; 9 Suppl 3: S8, 2008 Apr 11.

Artigo em Inglês | MEDLINE | ID: mdl-18426553

RESUMO

BACKGROUND: This paper describes the design of an event ontology being developed for application in the machine understanding of infectious disease-related events reported in natural language text. This event ontology is designed to support timely detection of disease outbreaks and rapid judgment of their alerting status by 1) bridging a gap between layman's language used in disease outbreak reports and public health experts' deep knowledge, and 2) making multi-lingual information available. CONSTRUCTION AND CONTENT: This event ontology integrates a model of experts' knowledge for disease surveillance, and at the same time sets of linguistic expressions which denote disease-related events, and formal definitions of events. In this ontology, rather general event classes, which are suitable for application to language-oriented tasks such as recognition of event expressions, are placed on the upper-level, and more specific events of the experts' interest are in the lower level. Each class is related to other classes which represent participants of events, and linked with multi-lingual synonym sets and axioms. CONCLUSIONS: We consider that the design of the event ontology and the methodology introduced in this paper are applicable to other domains which require integration of natural language information and machine support for experts to assess them. The first version of the ontology, with about 40 concepts, will be available in March 2008.

Assuntos

Algoritmos , Inteligência Artificial , Surtos de Doenças/prevenção & controle , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Vigilância da População/métodos , Vocabulário Controlado

A multilingual ontology for infectious disease surveillance: rationale, design and challenges.

Collier, Nigel; Kawazoe, Ai; Jin, Lihua; Shigematsu, Mika; Dien, Dinh; Barrero, Roberto A; Takeuchi, Koichi; Kawtrakul, Asanee.

Lang Resour Eval ; 40(3): 405, 2006.

Artigo em Inglês | MEDLINE | ID: mdl-32214930

RESUMO

A lack of surveillance system infrastructure in the Asia-Pacific region is seen as hindering the global control of rapidly spreading infectious diseases such as the recent avian H5N1 epidemic. As part of improving surveillance in the region, the BioCaster project aims to develop a system based on text mining for automatically monitoring Internet news and other online sources in several regional languages. At the heart of the system is an application ontology which serves the dual purpose of enabling advanced searches on the mined facts and of allowing the system to make intelligent inferences for assessing the priority of events. However, it became clear early on in the project that existing classification schemes did not have the necessary language coverage or semantic specificity for our needs. In this article we present an overview of our needs and explore in detail the rationale and methods for developing a new conceptual structure and multilingual terminological resource that focusses on priority pathogens and the diseases they cause. The ontology is made freely available as an online database and downloadable OWL file.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA