Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-29271009

RESUMO

OBJECTIVES: As electronic mental health records become more widely available, several approaches have been suggested to automatically extract information from free-text narrative aiming to support epidemiological research and clinical decision-making. In this paper, we explore extraction of explicit mentions of symptom severity from initial psychiatric evaluation records. We use the data provided by the 2016 CEGS N-GRID NLP shared task Track 2, which contains 541 records manually annotated for symptom severity according to the Research Domain Criteria. METHODS: We designed and implemented 3 automatic methods: a knowledge-driven approach relying on local lexicalized rules based on common syntactic patterns in text suggesting positive valence symptoms; a machine learning method using a neural network; and a hybrid approach combining the first 2 methods with a neural network. RESULTS: The results on an unseen evaluation set of 216 psychiatric evaluation records showed a performance of 80.1% for the rule-based method, 73.3% for the machine-learning approach, and 72.0% for the hybrid one. CONCLUSIONS: Although more work is needed to improve the accuracy, the results are encouraging and indicate that automated text mining methods can be used to classify mental health symptom severity from free text psychiatric notes to support epidemiological and clinical research.


Assuntos
Mineração de Dados/métodos , Registros Eletrônicos de Saúde , Aprendizado de Máquina , Transtornos Mentais/fisiopatologia , Índice de Gravidade de Doença , Adulto , Humanos , Transtornos Mentais/diagnóstico , Redes Neurais de Computação
2.
J Biomed Inform ; 75S: S28-S33, 2017 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-28602908

RESUMO

De-identification of clinical narratives is one of the main obstacles to making healthcare free text available for research. In this paper we describe our experience in expanding and tailoring two existing tools as part of the 2016 CEGS N-GRID Shared Tasks Track 1, which evaluated de-identification methods on a set of psychiatric evaluation notes for up to 25 different types of Protected Health Information (PHI). The methods we used rely on machine learning on either a large or small feature space, with additional strategies, including two-pass tagging and multi-class models, which both proved to be beneficial. The results show that the integration of the proposed methods can identify Health Information Portability and Accountability Act (HIPAA) defined PHIs with overall F1-scores of ∼90% and above. Yet, some classes (Profession, Organization) proved again to be challenging given the variability of expressions used to reference given information.


Assuntos
Algoritmos , Confidencialidade , Transtornos Mentais/psicologia , Health Insurance Portability and Accountability Act , Humanos , Aprendizado de Máquina , Estados Unidos
3.
Pediatr Radiol ; 46(1): 73-81, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26403618

RESUMO

BACKGROUND: Birth-related acute profound hypoxic-ischaemic brain injury has specific patterns of damage including the paracentral lobules. OBJECTIVE: To test the hypothesis that there is anatomically coherent regional volume loss of the corpus callosum as a result of this hemispheric abnormality. MATERIALS AND METHODS: Study subjects included 13 children with proven acute profound hypoxic-ischaemic brain injury and 13 children with developmental delay but no brain abnormalities. A computerised system divided the corpus callosum into 100 segments, measuring each width. Principal component analysis grouped the widths into contiguous anatomical regions. We conducted analysis of variance of corpus callosum widths as well as support vector machine stratification into patient groups. RESULTS: There was statistically significant narrowing of the mid-posterior body and genu of the corpus callosum in children with hypoxic-ischaemic brain injury. Support vector machine analysis yielded over 95% accuracy in patient group stratification using the corpus callosum centile widths. CONCLUSION: Focal volume loss is seen in the corpus callosum of children with hypoxic-ischaemic brain injury secondary to loss of commissural fibres arising in the paracentral lobules. Support vector machine stratification into the hypoxic-ischaemic brain injury group or the control group on the basis of corpus callosum width is highly accurate and points towards rapid clinical translation of this technique as a potential biomarker of hypoxic-ischaemic brain injury.


Assuntos
Corpo Caloso/lesões , Corpo Caloso/patologia , Hipóxia-Isquemia Encefálica/patologia , Imageamento por Ressonância Magnética/métodos , Adolescente , Criança , Pré-Escolar , Feminino , Humanos , Lactente , Masculino , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
4.
J Biomed Inform ; 58 Suppl: S183-S188, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26133479

RESUMO

Heart disease is the leading cause of death globally and a significant part of the human population lives with it. A number of risk factors have been recognized as contributing to the disease, including obesity, coronary artery disease (CAD), hypertension, hyperlipidemia, diabetes, smoking, and family history of premature CAD. This paper describes and evaluates a methodology to extract mentions of such risk factors from diabetic clinical notes, which was a task of the i2b2/UTHealth 2014 Challenge in Natural Language Processing for Clinical Data. The methodology is knowledge-driven and the system implements local lexicalized rules (based on syntactical patterns observed in notes) combined with manually constructed dictionaries that characterize the domain. A part of the task was also to detect the time interval in which the risk factors were present in a patient. The system was applied to an evaluation set of 514 unseen notes and achieved a micro-average F-score of 88% (with 86% precision and 90% recall). While the identification of CAD family history, medication and some of the related disease factors (e.g. hypertension, diabetes, hyperlipidemia) showed quite good results, the identification of CAD-specific indicators proved to be more challenging (F-score of 74%). Overall, the results are encouraging and suggested that automated text mining methods can be used to process clinical notes to identify risk factors and monitor progression of heart disease on a large-scale, providing necessary data for clinical and epidemiological studies.


Assuntos
Doenças Cardiovasculares/epidemiologia , Mineração de Dados/métodos , Complicações do Diabetes/epidemiologia , Registros Eletrônicos de Saúde/organização & administração , Narração , Processamento de Linguagem Natural , Idoso , Doenças Cardiovasculares/diagnóstico , Estudos de Coortes , Comorbidade , Segurança Computacional , Confidencialidade , Complicações do Diabetes/diagnóstico , Feminino , Humanos , Incidência , Estudos Longitudinais , Masculino , Pessoa de Meia-Idade , Reconhecimento Automatizado de Padrão/métodos , Medição de Risco/métodos , Semântica , Reino Unido/epidemiologia , Vocabulário Controlado
5.
J Biomed Inform ; 58 Suppl: S53-S59, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26210359

RESUMO

A recent promise to access unstructured clinical data from electronic health records on large-scale has revitalized the interest in automated de-identification of clinical notes, which includes the identification of mentions of Protected Health Information (PHI). We describe the methods developed and evaluated as part of the i2b2/UTHealth 2014 challenge to identify PHI defined by 25 entity types in longitudinal clinical narratives. Our approach combines knowledge-driven (dictionaries and rules) and data-driven (machine learning) methods with a large range of features to address de-identification of specific named entities. In addition, we have devised a two-pass recognition approach that creates a patient-specific run-time dictionary from the PHI entities identified in the first step with high confidence, which is then used in the second pass to identify mentions that lack specific clues. The proposed method achieved the overall micro F1-measures of 91% on strict and 95% on token-level evaluation on the test dataset (514 narratives). Whilst most PHI entities can be reliably identified, particularly challenging were mentions of Organizations and Professions. Still, the overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.


Assuntos
Segurança Computacional , Confidencialidade , Registros Eletrônicos de Saúde/organização & administração , Narração , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Estudos de Coortes , Simulação por Computador , Mineração de Dados/métodos , Aprendizado de Máquina , Modelos Estatísticos , Reino Unido , Vocabulário Controlado
6.
Int J Med Inform ; 83(9): 605-23, 2014 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-25008281

RESUMO

PURPOSE: This paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research. METHODS: A review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar. RESULTS: A range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/tendências , Oncologia , Neoplasias , Humanos , Armazenamento e Recuperação da Informação
7.
J Am Med Inform Assoc ; 20(5): 859-66, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23605114

RESUMO

OBJECTIVE: Identification of clinical events (eg, problems, tests, treatments) and associated temporal expressions (eg, dates and times) are key tasks in extracting and managing data from electronic health records. As part of the i2b2 2012 Natural Language Processing for Clinical Data challenge, we developed and evaluated a system to automatically extract temporal expressions and events from clinical narratives. The extracted temporal expressions were additionally normalized by assigning type, value, and modifier. MATERIALS AND METHODS: The system combines rule-based and machine learning approaches that rely on morphological, lexical, syntactic, semantic, and domain-specific features. Rule-based components were designed to handle the recognition and normalization of temporal expressions, while conditional random fields models were trained for event and temporal recognition. RESULTS: The system achieved micro F scores of 90% for the extraction of temporal expressions and 87% for clinical event extraction. The normalization component for temporal expressions achieved accuracies of 84.73% (expression's type), 70.44% (value), and 82.75% (modifier). DISCUSSION: Compared to the initial agreement between human annotators (87-89%), the system provided comparable performance for both event and temporal expression mining. While (lenient) identification of such mentions is achievable, finding the exact boundaries proved challenging. CONCLUSIONS: The system provides a state-of-the-art method that can be used to support automated identification of mentions of clinical events and temporal expressions in narratives either to support the manual review process or as a part of a large-scale processing of electronic health databases.


Assuntos
Inteligência Artificial , Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Humanos , Processamento de Linguagem Natural , Tempo , Pesquisa Translacional Biomédica
8.
Biomed Inform Insights ; 5(Suppl. 1): 115-24, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22879767

RESUMO

We describe and evaluate an automated approach used as part of the i2b2 2011 challenge to identify and categorise statements in suicide notes into one of 15 topics, including Love, Guilt, Thankfulness, Hopelessness and Instructions. The approach combines a set of lexico-syntactic rules with a set of models derived by machine learning from a training dataset. The machine learning models rely on named entities, lexical, lexico-semantic and presentation features, as well as the rules that are applicable to a given statement. On a testing set of 300 suicide notes, the approach showed the overall best micro F-measure of up to 53.36%. The best precision achieved was 67.17% when only rules are used, whereas best recall of 50.57% was with integrated rules and machine learning. While some topics (eg, Sorrow, Anger, Blame) prove challenging, the performance for relatively frequent (eg, Love) and well-scoped categories (eg, Thankfulness) was comparatively higher (precision between 68% and 79%), suggesting that automated text mining approaches can be effective in topic categorisation of suicide notes.

9.
J Am Med Inform Assoc ; 17(5): 532-5, 2010.
Artigo em Inglês | MEDLINE | ID: mdl-20819858

RESUMO

OBJECTIVE: This study presents a system developed for the 2009 i2b2 Challenge in Natural Language Processing for Clinical Data, whose aim was to automatically extract certain information about medications used by a patient from his/her medical report. The aim was to extract the following information for each medication: name, dosage, mode/route, frequency, duration and reason. DESIGN: The system implements a rule-based methodology, which exploits typical morphological, lexical, syntactic and semantic features of the targeted information. These features were acquired from the training dataset and public resources such as the UMLS and relevant web pages. Information extracted by pattern matching was combined together using context-sensitive heuristic rules. MEASUREMENTS: The system was applied to a set of 547 previously unseen discharge summaries, and the extracted information was evaluated against a manually prepared gold standard consisting of 251 documents. The overall ranking of the participating teams was obtained using the micro-averaged F-measure as the primary evaluation metric. RESULTS: The implemented method achieved the micro-averaged F-measure of 81% (with 86% precision and 77% recall), which ranked this system third in the challenge. The significance tests revealed the system's performance to be not significantly different from that of the second ranked system. Relative to other systems, this system achieved the best F-measure for the extraction of duration (53%) and reason (46%). CONCLUSION: Based on the F-measure, the performance achieved (81%) was in line with the initial agreement between human annotators (82%), indicating that such a system may greatly facilitate the process of extracting relevant information from medical records by providing a solid basis for a manual review process.


Assuntos
Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Preparações Farmacêuticas , Inteligência Artificial , Humanos , Linguística , Semântica
10.
J Am Med Inform Assoc ; 16(4): 596-600, 2009.
Artigo em Inglês | MEDLINE | ID: mdl-19390098

RESUMO

OBJECTIVE The authors present a system developed for the Challenge in Natural Language Processing for Clinical Data-the i2b2 obesity challenge, whose aim was to automatically identify the status of obesity and 15 related co-morbidities in patients using their clinical discharge summaries. The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases, whereas the intuitive task focused on the prediction of the disease status when the evidence was not explicitly asserted. DESIGN The authors assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. These features were explored in a hybrid text mining approach, which combined dictionary look-up, rule-based, and machine-learning methods. MEASUREMENTS The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against a manually prepared gold standard. The overall ranking of the participating teams was primarily based on the macro-averaged F-measure. RESULTS The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was the highest achieved in the challenge) and 63% for the intuitive task (ranked 7(th) out of 28 teams-the highest was 66%). The micro-averaged F-measure showed an average accuracy of 97% for textual and 96% for intuitive annotations. CONCLUSIONS The performance achieved was in line with the agreement between human annotators, indicating the potential of text mining for accurate and efficient prediction of disease statuses from clinical discharge summaries.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Sistemas Computadorizados de Registros Médicos , Processamento de Linguagem Natural , Obesidade , Alta do Paciente , Comorbidade , Humanos , Software
11.
BMC Bioinformatics ; 9 Suppl 3: S11, 2008 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-18426546

RESUMO

BACKGROUND: Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature. RESULTS: In this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%. CONCLUSIONS: The experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data.


Assuntos
Algoritmos , Inteligência Artificial , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Terminologia como Assunto , Fatores de Transcrição/classificação , Fatores de Transcrição/metabolismo , Vocabulário Controlado
12.
Bioinformation ; 2(5): 197-206, 2007 Dec 30.
Artigo em Inglês | MEDLINE | ID: mdl-18305829

RESUMO

Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...