Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
1.
BMC Med Inform Decis Mak ; 24(1): 162, 2024 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-38915012

RESUMO

Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.


Assuntos
Processamento de Linguagem Natural , Humanos , Privacidade , Suécia , Anônimos e Pseudônimos , Segurança Computacional/normas , Confidencialidade/normas , Registros Eletrônicos de Saúde/normas
2.
JMIR Res Protoc ; 13: e54593, 2024 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-38470476

RESUMO

BACKGROUND: Computer-assisted clinical coding (CAC) tools are designed to help clinical coders assign standardized codes, such as the ICD-10 (International Statistical Classification of Diseases, Tenth Revision), to clinical texts, such as discharge summaries. Maintaining the integrity of these standardized codes is important both for the functioning of health systems and for ensuring data used for secondary purposes are of high quality. Clinical coding is an error-prone cumbersome task, and the complexity of modern classification systems such as the ICD-11 (International Classification of Diseases, Eleventh Revision) presents significant barriers to implementation. To date, there have only been a few user studies; therefore, our understanding is still limited regarding the role CAC systems can play in reducing the burden of coding and improving the overall quality of coding. OBJECTIVE: The objective of the user study is to generate both qualitative and quantitative data for measuring the usefulness of a CAC system, Easy-ICD, that was developed for recommending ICD-10 codes. Specifically, our goal is to assess whether our tool can reduce the burden on clinical coders and also improve coding quality. METHODS: The user study is based on a crossover randomized controlled trial study design, where we measure the performance of clinical coders when they use our CAC tool versus when they do not. Performance is measured by the time it takes them to assign codes to both simple and complex clinical texts as well as the coding quality, that is, the accuracy of code assignment. RESULTS: We expect the study to provide us with a measurement of the effectiveness of the CAC system compared to manual coding processes, both in terms of time use and coding quality. Positive outcomes from this study will imply that CAC tools hold the potential to reduce the burden on health care staff and will have major implications for the adoption of artificial intelligence-based CAC innovations to improve coding practice. Expected results to be published summer 2024. CONCLUSIONS: The planned user study promises a greater understanding of the impact CAC systems might have on clinical coding in real-life settings, especially with regard to coding time and quality. Further, the study may add new insights on how to meaningfully exploit current clinical text mining capabilities, with a view to reducing the burden on clinical coders, thus lowering the barriers and paving a more sustainable path to the adoption of modern coding systems, such as the new ICD-11. TRIAL REGISTRATION: clinicaltrials.gov NCT06286865; https://clinicaltrials.gov/study/NCT06286865. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/54593.

3.
Sci Rep ; 13(1): 11760, 2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37474597

RESUMO

Sepsis is a leading cause of mortality and early identification improves survival. With increasing digitalization of health care data automated sepsis prediction models hold promise to aid in prompt recognition. Most previous studies have focused on the intensive care unit (ICU) setting. Yet only a small proportion of sepsis develops in the ICU and there is an apparent clinical benefit to identify patients earlier in the disease trajectory. In this cohort of 82,852 hospital admissions and 8038 sepsis episodes classified according to the Sepsis-3 criteria, we demonstrate that a machine learned score can predict sepsis onset within 48 h using sparse routine electronic health record data outside the ICU. Our score was based on a causal probabilistic network model-SepsisFinder-which has similarities with clinical reasoning. A prediction was generated hourly on all admissions, providing a new variable was registered. Compared to the National Early Warning Score (NEWS2), which is an established method to identify sepsis, the SepsisFinder triggered earlier and had a higher area under receiver operating characteristic curve (AUROC) (0.950 vs. 0.872), as well as area under precision-recall curve (APR) (0.189 vs. 0.149). A machine learning comparator based on a gradient-boosting decision tree model had similar AUROC (0.949) and higher APR (0.239) than SepsisFinder but triggered later than both NEWS2 and SepsisFinder. The precision of SepsisFinder increased if screening was restricted to the earlier admission period and in episodes with bloodstream infection. Furthermore, the SepsisFinder signaled median 5.5 h prior to antibiotic administration. Identifying a high-risk population with this method could be used to tailor clinical interventions and improve patient care.


Assuntos
Registros Eletrônicos de Saúde , Sepse , Humanos , Estudos Retrospectivos , Sepse/diagnóstico , Sepse/epidemiologia , Algoritmos , Hospitalização , Curva ROC , Unidades de Terapia Intensiva , Mortalidade Hospitalar
4.
AMIA Annu Symp Proc ; 2023: 465-473, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38222373

RESUMO

With the recent advances in natural language processing and deep learning, the development of tools that can assist medical coders in ICD-10 diagnosis coding and increase their efficiency in coding discharge summaries is significantly more viable than before. To that end, one important component in the development of these models is the datasets used to train them. In this study, such datasets are presented, and it is shown that one of them can be used to develop a BERT-based language model that can consistently perform well in assigning ICD-10 codes to discharge summaries written in Swedish. Most importantly, it can be used in a coding support setup where a tool can recommend potential codes to the coders. This reduces the range of potential codes to consider and, in turn, reduces the workload of the coder. Moreover, the de-identified and pseudonymised dataset is open to use for academic users.


Assuntos
Classificação Internacional de Doenças , Alta do Paciente , Humanos , Processamento de Linguagem Natural , Codificação Clínica
5.
AMIA Annu Symp Proc ; 2023: 456-464, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38222432

RESUMO

The lack of relevant annotated datasets represents one key limitation in the application of Natural Language Processing techniques in a broad number of tasks, among them Protected Health Information (PHI) identification in Norwegian clinical text. In this work, the possibility of exploiting resources from Swedish, a very closely related language, to Norwegian is explored. The Swedish dataset is annotated with PHI information. Different processing and text augmentation techniques are evaluated, along with their impact in the final performance of the model. The augmentation techniques, such as injection and generation of both Norwegian and Scandinavian Named Entities into the Swedish training corpus, showed to increase the performance in the de-identification task for both Danish and Norwegian text. This trend was also confirmed by the evaluation of model performance on a sample Norwegian gastro surgical clinical text.


Assuntos
Registros Eletrônicos de Saúde , Idioma , Humanos , Suécia , Processamento de Linguagem Natural , Dinamarca
6.
Artigo em Inglês | MEDLINE | ID: mdl-36310782

RESUMO

We developed and validated a set of fully automated surveillance algorithms for healthcare-onset CDI using electronic health records. In a validation data set of 750 manually annotated admissions, the algorithm based on International Classification of Disease, Tenth Revision (ICD-10) code A04.7 had insufficient sensitivity. Algorithms based on microbiological test results with or without addition of symptoms performed well.

7.
J Biomed Inform ; 130: 104050, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35346854

RESUMO

Multi-label classification according to the International Classification of Diseases (ICD) is an Extreme Multi-label Classification task aiming to categorise health records according to a set of relevant ICD codes. We implemented PlaBERT, a new multi-label text classification head with per-label attention, on top of a BERT model. The model assessment is conducted on Electronic Health Records, conveying Discharge Summaries in three languages - English, Spanish, and Swedish. The study focuses on 157 diagnostic codes from the ICD. We additionally measure the labelling noise to estimate the consistency of the gold standard. Our specialised attention mechanism computes attention weights for each input token and label pair, obtaining the specific relevance of every word concerning each ICD code. The PlaBERT model outputs the computed attention importance for each token and label, allowing for visualisation. Our best results are 40.65, 38.36, and 41.13 F1-Score points on the English, Spanish and Swedish datasets, respectively, for the 157 gastrointestinal codes. Besides, Precision is the metric that most significantly improves owing to the attention mechanism of PlaBERT, with an increase of 44.63, 40.93, and 12.92 points, respectively, for the Spanish, Swedish and English datasets.


Assuntos
Classificação Internacional de Doenças , Idioma , Registros Eletrônicos de Saúde , Humanos , Processamento de Linguagem Natural , Alta do Paciente , Suécia
8.
Crit Care Med ; 50(3): e272-e283, 2022 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-34406170

RESUMO

OBJECTIVES: Sequential Organ Failure Assessment score is the basis of the Sepsis-3 criteria and requires arterial blood gas analysis to assess respiratory function. Peripheral oxygen saturation is a noninvasive alternative but is not included in neither Sequential Organ Failure Assessment score nor Sepsis-3. We aimed to assess the association between worst peripheral oxygen saturation during onset of suspected infection and mortality. DESIGN: Cohort study of hospital admissions from a main cohort and emergency department visits from four external validation cohorts between year 2011 and 2018. Data were collected from electronic health records and prospectively by study investigators. SETTING: Eight academic and community hospitals in Sweden and Canada. PATIENTS: Adult patients with suspected infection episodes. INTERVENTIONS: None. MEASUREMENTS AND MAIN RESULTS: The main cohort included 19,396 episodes (median age, 67.0 [53.0-77.0]; 9,007 [46.4%] women; 1,044 [5.4%] died). The validation cohorts included 10,586 episodes (range of median age, 61.0-76.0; women 42.1-50.2%; mortality 2.3-13.3%). Peripheral oxygen saturation levels 96-95% were not significantly associated with increased mortality in the main or pooled validation cohorts. At peripheral oxygen saturation 94%, the adjusted odds ratio of death was 1.56 (95% CI, 1.10-2.23) in the main cohort and 1.36 (95% CI, 1.00-1.85) in the pooled validation cohorts and increased gradually below this level. Respiratory assessment using peripheral oxygen saturation 94-91% and less than 91% to generate 1 and 2 Sequential Organ Failure Assessment points, respectively, improved the discrimination of the Sequential Organ Failure Assessment score from area under the receiver operating characteristics 0.75 (95% CI, 0.74-0.77) to 0.78 (95% CI, 0.77-0.80; p < 0.001). Peripheral oxygen saturation/Fio2 ratio had slightly better predictive performance compared with peripheral oxygen saturation alone, but the clinical impact was minor. CONCLUSIONS: These findings provide evidence for assessing respiratory function with peripheral oxygen saturation in the Sequential Organ Failure Assessment score and the Sepsis-3 criteria. Our data support using peripheral oxygen saturation thresholds 94% and 90% to get 1 and 2 Sequential Organ Failure Assessment respiratory points, respectively. This has important implications primarily for emergency practice, rapid response teams, surveillance, research, and resource-limited settings.


Assuntos
Unidades de Terapia Intensiva , Escores de Disfunção Orgânica , Consumo de Oxigênio/fisiologia , Saturação de Oxigênio/fisiologia , Sepse/sangue , Sepse/mortalidade , Idoso , Estudos de Coortes , Feminino , Mortalidade Hospitalar , Humanos , Masculino , Pessoa de Meia-Idade , Oxigênio/sangue , Estudos Retrospectivos , Síndrome de Resposta Inflamatória Sistêmica
9.
Ups J Med Sci ; 125(4): 316-324, 2020 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-32696698

RESUMO

BACKGROUND: The electronic medical record (EMR) offers unique possibilities for clinical research, but some important patient attributes are not readily available due to its unstructured properties. We applied text mining using machine learning to enable automatic classification of unstructured information on smoking status from Swedish EMR data. METHODS: Data on patients' smoking status from EMRs were used to develop 32 different predictive models that were trained using Weka, changing sentence frequency, classifier type, tokenization, and attribute selection in a database of 85,000 classified sentences. The models were evaluated using F-score and accuracy based on out-of-sample test data including 8500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the model confusion matrices. The best performing model was then compared to a rule-based method. RESULTS: The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.14% accuracy and 0.981 F-score versus 79.32% and 0.756 for the rule-based model. CONCLUSION: A model using machine-learning algorithms to automatically classify patients' smoking status was successfully developed. Such algorithms may enable automatic assessment of smoking status and other unstructured data directly from EMRs without manual classification of complete case notes.


Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Processamento de Linguagem Natural , Fumar , Tabagismo/diagnóstico , Algoritmos , Automação , Teorema de Bayes , Mineração de Dados , Reações Falso-Positivas , Humanos , Informática Médica , Variações Dependentes do Observador , Reconhecimento Automatizado de Padrão , Curva ROC , Reprodutibilidade dos Testes , Projetos de Pesquisa , Software , Máquina de Vetores de Suporte , Suécia/epidemiologia , Tabagismo/epidemiologia
10.
Stud Health Technol Inform ; 270: 148-152, 2020 Jun 16.
Artigo em Inglês | MEDLINE | ID: mdl-32570364

RESUMO

Sensitive data is normally required to develop rule-based or train machine learning-based models for de-identifying electronic health record (EHR) clinical notes; and this presents important problems for patient privacy. In this study, we add non-sensitive public datasets to EHR training data; (i) scientific medical text and (ii) Wikipedia word vectors. The data, all in Swedish, is used to train a deep learning model using recurrent neural networks. Tests on pseudonymized Swedish EHR clinical notes showed improved precision and recall from 55.62% and 80.02% with the base EHR embedding layer, to 85.01% and 87.15% when Wikipedia word vectors are added. These results suggest that non-sensitive text from the general domain can be used to train robust models for de-identifying Swedish clinical text; and this could be useful in cases where the data is both sensitive and in low-resource languages.


Assuntos
Registros Eletrônicos de Saúde , Idioma , Aprendizado de Máquina , Processamento de Linguagem Natural , Suécia
11.
BMJ Qual Saf ; 29(9): 735-745, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-32029574

RESUMO

BACKGROUND: Surveillance of sepsis incidence is important for directing resources and evaluating quality-of-care interventions. The aim was to develop and validate a fully-automated Sepsis-3 based surveillance system in non-intensive care wards using electronic health record (EHR) data, and demonstrate utility by determining the burden of hospital-onset sepsis and variations between wards. METHODS: A rule-based algorithm was developed using EHR data from a cohort of all adult patients admitted at an academic centre between July 2012 and December 2013. Time in intensive care units was censored. To validate algorithm performance, a stratified random sample of 1000 hospital admissions (674 with and 326 without suspected infection) was classified according to the Sepsis-3 clinical criteria (suspected infection defined as having any culture taken and at least two doses of antimicrobials administered, and an increase in Sequential Organ Failure Assessment (SOFA) score by >2 points) and the likelihood of infection by physician medical record review. RESULTS: In total 82 653 hospital admissions were included. The Sepsis-3 clinical criteria determined by physician review were met in 343 of 1000 episodes. Among them, 313 (91%) had possible, probable or definite infection. Based on this reference, the algorithm achieved sensitivity 0.887 (95% CI: 0.799 to 0.964), specificity 0.985 (95% CI: 0.978 to 0.991), positive predictive value 0.881 (95% CI: 0.833 to 0.926) and negative predictive value 0.986 (95% CI: 0.973 to 0.996). When applied to the total cohort taking into account the sampling proportions of those with and without suspected infection, the algorithm identified 8599 (10.4%) sepsis episodes. The burden of hospital-onset sepsis (>48 hour after admission) and related in-hospital mortality varied between wards. CONCLUSIONS: A fully-automated Sepsis-3 based surveillance algorithm using EHR data performed well compared with physician medical record review in non-intensive care wards, and exposed variations in hospital-onset sepsis incidence between wards.


Assuntos
Médicos , Sepse , Adulto , Registros Eletrônicos de Saúde , Feminino , Infecções por HIV , Mortalidade Hospitalar , Hospitais Gerais , Humanos , Unidades de Terapia Intensiva , Estudos Retrospectivos
12.
Health Informatics J ; 25(4): 1779-1799, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-30232926

RESUMO

This article describes the development and evaluation of a set of knowledge patterns that provide guidelines and implications of design for developers of mental health portals. The knowledge patterns were based on three foundations: (1) knowledge integration of language technology approaches; (2) experiments with language technology applications and (3) user studies of portal interaction. A mixed-methods approach was employed for the evaluation of the knowledge patterns: formative workshops with knowledge pattern experts and summative surveys with experts in specific domains. The formative evaluation improved the cohesion of the patterns. The results of the summative evaluation showed that the problems discussed in the patterns were relevant for the domain, and that the knowledge embedded was useful to solve them. Ten patterns out of thirteen achieved an average score above 4.0, which is a positive result that leads us to conclude that they can be used as guidelines for developing health portals.


Assuntos
Conhecimento , Portais do Paciente , Desenvolvimento de Programas/métodos , Humanos , Desenvolvimento de Programas/estatística & dados numéricos , Avaliação de Programas e Projetos de Saúde/métodos , Avaliação de Programas e Projetos de Saúde/estatística & dados numéricos , Inquéritos e Questionários
13.
J Biomed Semantics ; 9(1): 12, 2018 03 30.
Artigo em Inglês | MEDLINE | ID: mdl-29602312

RESUMO

BACKGROUND: Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. MAIN BODY: We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. CONCLUSION: We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages.


Assuntos
Processamento de Linguagem Natural , Humanos , Semântica
14.
Health Informatics J ; 24(1): 24-42, 2018 03.
Artigo em Inglês | MEDLINE | ID: mdl-27496862

RESUMO

Hospital-acquired infections pose a significant risk to patient health, while their surveillance is an additional workload for hospital staff. Our overall aim is to build a surveillance system that reliably detects all patient records that potentially include hospital-acquired infections. This is to reduce the burden of having the hospital staff manually check patient records. This study focuses on the application of text classification using support vector machines and gradient tree boosting to the problem. Support vector machines and gradient tree boosting have never been applied to the problem of detecting hospital-acquired infections in Swedish patient records, and according to our experiments, they lead to encouraging results. The best result is yielded by gradient tree boosting, at 93.7 percent recall, 79.7 percent precision and 85.7 percent F1 score when using stemming. We can show that simple preprocessing techniques and parameter tuning can lead to high recall (which we aim for in screening patient records) with appropriate precision for this task.


Assuntos
Análise de Dados , Doença Iatrogênica , Infecções/diagnóstico , Aprendizado de Máquina/normas , Máquina de Vetores de Suporte/normas , Registros Eletrônicos de Saúde/estatística & dados numéricos , Humanos , Infecções/classificação , Infecções/etiologia , Aprendizado de Máquina/estatística & dados numéricos , Programas de Rastreamento/métodos , Programas de Rastreamento/normas
15.
J Biomed Inform ; 71: 16-30, 2017 07.
Artigo em Inglês | MEDLINE | ID: mdl-28526460

RESUMO

OBJECTIVE: The goal of this study is to investigate entity recognition within Electronic Health Records (EHRs) focusing on Spanish and Swedish. Of particular importance is a robust representation of the entities. In our case, we utilized unsupervised methods to generate such representations. METHODS: The significance of this work stands on its experimental layout. The experiments were carried out under the same conditions for both languages. Several classification approaches were explored: maximum probability, CRF, Perceptron and SVM. The classifiers were enhanced by means of ensembles of semantic spaces and ensembles of Brown trees. In order to mitigate sparsity of data, without a significant increase in the dimension of the decision space, we propose the use of clustered approaches of the hierarchical Brown clustering represented by trees and vector quantization for each semantic space. RESULTS: The results showed that the semi-supervised approaches significantly improved standard supervised techniques for both languages. Moreover, clustering the semantic spaces contributed to the quality of the entity recognition while keeping the dimension of the feature-space two orders of magnitude lower than when directly using the semantic spaces. CONCLUSIONS: The contributions of this study are: (a) a set of thorough experiments that enable comparisons regarding the influence of different types of features on different classifiers, exploring two languages other than English; and (b) the use of ensembles of clusters of Brown trees and semantic spaces on EHRs to tackle the problem of scarcity of available annotated data.


Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Semântica , Análise por Conglomerados , Curadoria de Dados , Humanos , Suécia
16.
Stud Health Technol Inform ; 235: 216-220, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28423786

RESUMO

Obscuring protected health information (PHI) in the clinical text of health records facilitates the secondary use of healthcare data in a privacy-preserving manner. Although automatic de-identification of clinical text using machine learning holds much promise, little is known about the relative prevalence of PHI in different types of clinical text and whether there is a need for domain adaptation when learning predictive models from one particular domain and applying it to another. In this study, we address these questions by training a predictive model and using it to estimate the prevalence of PHI in clinical text written (1) in different clinical specialties, (2) in different types of notes (i.e., under different headings), and (3) by persons in different professional roles. It is demonstrated that the overall PHI density is 1.57%; however, substantial differences exist across domains.


Assuntos
Confidencialidade , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Humanos , Aprendizado de Máquina , Prontuários Médicos , Prevalência , Suécia
17.
Stud Health Technol Inform ; 245: 393-397, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29295123

RESUMO

To enable secondary use of healthcare data in a privacy-preserving manner, there is a need for methods capable of automatically identifying protected health information (PHI) in clinical text. To that end, learning predictive models from labeled examples has emerged as a promising alternative to rule-based systems. However, little is known about differences with respect to PHI prevalence in different types of clinical notes and how potential domain differences may affect the performance of predictive models trained on one particular type of note and applied to another. In this study, we analyze the performance of a predictive model trained on an existing PHI corpus of Swedish clinical notes and applied to a variety of clinical notes: written (i) in different clinical specialties, (ii) under different headings, and (iii) by persons in different professions. The results indicate that domain adaption is needed for effective detection of PHI in heterogeneous clinical notes.


Assuntos
Registros Eletrônicos de Saúde , Privacidade , Humanos , Processamento de Linguagem Natural , Prevalência , Suécia
18.
BMC Med Inform Decis Mak ; 16 Suppl 2: 69, 2016 07 21.
Artigo em Inglês | MEDLINE | ID: mdl-27459846

RESUMO

BACKGROUND: Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events - modeled in an ensemble of semantic spaces - for the purpose of predictive modeling. METHODS: Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events - diagnosis codes, drug codes, measurements, and words in clinical notes - are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces - corresponding to the considered data types - of a given context window size. RESULTS: The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. CONCLUSIONS: The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy - significantly outperforming the considered alternatives - involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.


Assuntos
Árvores de Decisões , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Registros Eletrônicos de Saúde , Aprendizado de Máquina , Modelos Teóricos , Farmacovigilância , Humanos , Semântica
19.
J Biomed Inform ; 57: 333-49, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26291578

RESUMO

For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the voluntary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics - i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words - and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Registros Eletrônicos de Saúde , Semântica , Curadoria de Dados , Mineração de Dados , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...