Pesquisa | Portal Regional da BVS

Training a Deep Contextualized Language Model for International Classification of Diseases, 10th Revision Classification via Federated Learning: Model Development and Validation Study.

Chen, Pei-Fu; He, Tai-Liang; Lin, Sheng-Che; Chu, Yuan-Chia; Kuo, Chen-Tsung; Lai, Feipei; Wang, Ssu-Ming; Zhu, Wan-Xuan; Chen, Kuan-Chih; Kuo, Lu-Cheng; Hung, Fang-Ming; Lin, Yu-Cheng; Tsai, I-Chang; Chiu, Chi-Hao; Chang, Shu-Chih; Yang, Chi-Yu.

JMIR Med Inform ; 10(11): e41342, 2022 Nov 10.

Artigo em Inglês | MEDLINE | ID: mdl-36355417

RESUMO

BACKGROUND: The automatic coding of clinical text documents by using the International Classification of Diseases, 10th Revision (ICD-10) can be performed for statistical analyses and reimbursements. With the development of natural language processing models, new transformer architectures with attention mechanisms have outperformed previous models. Although multicenter training may increase a model's performance and external validity, the privacy of clinical documents should be protected. We used federated learning to train a model with multicenter data, without sharing data per se. OBJECTIVE: This study aims to train a classification model via federated learning for ICD-10 multilabel classification. METHODS: Text data from discharge notes in electronic medical records were collected from the following three medical centers: Far Eastern Memorial Hospital, National Taiwan University Hospital, and Taipei Veterans General Hospital. After comparing the performance of different variants of bidirectional encoder representations from transformers (BERT), PubMedBERT was chosen for the word embeddings. With regard to preprocessing, the nonalphanumeric characters were retained because the model's performance decreased after the removal of these characters. To explain the outputs of our model, we added a label attention mechanism to the model architecture. The model was trained with data from each of the three hospitals separately and via federated learning. The models trained via federated learning and the models trained with local data were compared on a testing set that was composed of data from the three hospitals. The micro F1 score was used to evaluate model performance across all 3 centers. RESULTS: The F1 scores of PubMedBERT, RoBERTa (Robustly Optimized BERT Pretraining Approach), ClinicalBERT, and BioBERT (BERT for Biomedical Text Mining) were 0.735, 0.692, 0.711, and 0.721, respectively. The F1 score of the model that retained nonalphanumeric characters was 0.8120, whereas the F1 score after removing these characters was 0.7875-a decrease of 0.0245 (3.11%). The F1 scores on the testing set were 0.6142, 0.4472, 0.5353, and 0.2522 for the federated learning, Far Eastern Memorial Hospital, National Taiwan University Hospital, and Taipei Veterans General Hospital models, respectively. The explainable predictions were displayed with highlighted input words via the label attention architecture. CONCLUSIONS: Federated learning was used to train the ICD-10 classification model on multicenter clinical text while protecting data privacy. The model's performance was better than that of models that were trained locally.

Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches.

Chen, Pei-Fu; Chen, Kuan-Chih; Liao, Wei-Chih; Lai, Feipei; He, Tai-Liang; Lin, Sheng-Che; Chen, Wei-Jen; Yang, Chi-Yu; Lin, Yu-Cheng; Tsai, I-Chang; Chiu, Chi-Hao; Chang, Shu-Chih; Hung, Fang-Ming.

JMIR Med Inform ; 10(6): e37557, 2022 Jun 29.

Artigo em Inglês | MEDLINE | ID: mdl-35767353

RESUMO

BACKGROUND: The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagnosis and procedure codes and applied in disease-related groups for reimbursement. The expansion of codes made the coding time-consuming and less accurate. The state-of-the-art model using deep contextual word embeddings was used for automatic multilabel text classification of ICD-10. In addition to input discharge diagnoses (DD), the performance can be improved by appropriate preprocessing methods for the text from other document types, such as medical history, comorbidity and complication, surgical method, and special examination. OBJECTIVE: This study aims to establish a contextual language model with rule-based preprocessing methods to develop the model for ICD-10 multilabel classification. METHODS: We retrieved electronic health records from a medical center. We first compared different word embedding methods. Second, we compared the preprocessing methods using the best-performing embeddings. We compared biomedical bidirectional encoder representations from transformers (BioBERT), clinical generalized autoregressive pretraining for language understanding (Clinical XLNet), label tree-based attention-aware deep model for high-performance extreme multilabel text classification (AttentionXLM), and word-to-vector (Word2Vec) to predict ICD-10-CM. To compare different preprocessing methods for ICD-10-CM, we included DD, medical history, and comorbidity and complication as inputs. We compared the performance of ICD-10-CM prediction using different preprocesses, including definition training, external cause code removal, number conversion, and combination code filtering. For the ICD-10 PCS, the model was trained using different combinations of DD, surgical method, and key words of special examination. The micro F1 score and the micro area under the receiver operating characteristic curve were used to compare the model's performance with that of different preprocessing methods. RESULTS: BioBERT had an F1 score of 0.701 and outperformed other models such as Clinical XLNet, AttentionXLM, and Word2Vec. For the ICD-10-CM, the model had an F1 score that significantly increased from 0.749 (95% CI 0.744-0.753) to 0.769 (95% CI 0.764-0.773) with the ICD-10 definition training, external cause code removal, number conversion, and combination code filter. For the ICD-10-PCS, the model had an F1 score that significantly increased from 0.670 (95% CI 0.663-0.678) to 0.726 (95% CI 0.719-0.732) with a combination of discharge diagnoses, surgical methods, and key words of special examination. With our preprocessing methods, the model had the highest area under the receiver operating characteristic curve of 0.853 (95% CI 0.849-0.855) and 0.831 (95% CI 0.827-0.834) for ICD-10-CM and ICD-10-PCS, respectively. CONCLUSIONS: The performance of our model with the pretrained contextualized language model and rule-based preprocessing method is better than that of the state-of-the-art model for ICD-10-CM or ICD-10-PCS. This study highlights the importance of rule-based preprocessing methods based on coder coding rules.

Risk assessment of metabolic syndrome prevalence involving sedentary occupations and socioeconomic status.

Chen, Ming-Shu; Chiu, Chi-Hao; Chen, Shih-Hsin.

BMJ Open ; 11(12): e042802, 2021 12 13.

Artigo em Inglês | MEDLINE | ID: mdl-34903529

RESUMO

OBJECTIVES: To determine whether occupation type, distinguished by socioeconomic status (SES) and sedentary status, is associated with metabolic syndrome (MetS) risk. METHODS: We analysed two data sets covering 73 506 individuals. MetS was identified according to the criteria of the modified Adult Treatment Panel III. Eight occupational categories were considered: professionals, technical workers, managers, salespeople, service staff, administrative staff, manual labourers and taxi drivers; occupations were grouped into non-sedentary; sedentary, high-SES; and sedentary, non-high-SES occupations. A multiple logistic regression was used to determine significant risk factors for MetS in three age-stratified subgroups. R software for Windows (V.3.5.1) was used for all statistical analyses. RESULTS: MetS prevalence increased with age. Among participants aged ≤40 years, where MetS prevalence was low at 6.23%, having a non-sedentary occupation reduced MetS risk (OR=0.88, p<0.0295). Among participants aged >60 years, having a sedentary, high-SES occupation significantly increased (OR=1.39, p<0.0247) MetS risk. CONCLUSIONS: The influence of occupation type on MetS risk differs among age groups. Non-sedentary occupations and sedentary, high-SES occupations decrease and increase MetS risk, respectively, among younger and older adults, respectively. Authorities should focus on individuals in sedentary, high-SES occupations.

Assuntos

Síndrome Metabólica , Adulto , Idoso , Humanos , Síndrome Metabólica/epidemiologia , Pessoa de Meia-Idade , Ocupações , Prevalência , Medição de Risco , Fatores de Risco , Classe Social

Automatic ICD-10 Coding and Training System: Deep Neural Network Based on Supervised Learning.

Chen, Pei-Fu; Wang, Ssu-Ming; Liao, Wei-Chih; Kuo, Lu-Cheng; Chen, Kuan-Chih; Lin, Yu-Cheng; Yang, Chi-Yu; Chiu, Chi-Hao; Chang, Shu-Chih; Lai, Feipei.

JMIR Med Inform ; 9(8): e23230, 2021 Aug 31.

Artigo em Inglês | MEDLINE | ID: mdl-34463639

RESUMO

BACKGROUND: The International Classification of Diseases (ICD) code is widely used as the reference in medical system and billing purposes. However, classifying diseases into ICD codes still mainly relies on humans reading a large amount of written material as the basis for coding. Coding is both laborious and time-consuming. Since the conversion of ICD-9 to ICD-10, the coding task became much more complicated, and deep learning- and natural language processing-related approaches have been studied to assist disease coders. OBJECTIVE: This paper aims at constructing a deep learning model for ICD-10 coding, where the model is meant to automatically determine the corresponding diagnosis and procedure codes based solely on free-text medical notes to improve accuracy and reduce human effort. METHODS: We used diagnosis records of the National Taiwan University Hospital as resources and apply natural language processing techniques, including global vectors, word to vectors, embeddings from language models, bidirectional encoder representations from transformers, and single head attention recurrent neural network, on the deep neural network architecture to implement ICD-10 auto-coding. Besides, we introduced the attention mechanism into the classification model to extract the keywords from diagnoses and visualize the coding reference for training freshmen in ICD-10. Sixty discharge notes were randomly selected to examine the change in the F1-score and the coding time by coders before and after using our model. RESULTS: In experiments on the medical data set of National Taiwan University Hospital, our prediction results revealed F1-scores of 0.715 and 0.618 for the ICD-10 Clinical Modification code and Procedure Coding System code, respectively, with a bidirectional encoder representations from transformers embedding approach in the Gated Recurrent Unit classification model. The well-trained models were applied on the ICD-10 web service for coding and training to ICD-10 users. With this service, coders can code with the F1-score significantly increased from a median of 0.832 to 0.922 (P<.05), but not in a reduced interval. CONCLUSIONS: The proposed model significantly improved the F1-score but did not decrease the time consumed in coding by disease coders.

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA