Pesquisa | Portal Regional da BVS (teste)

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study.

Lee, You-Qian; Chen, Ching-Tai; Chen, Chien-Chang; Lee, Chung-Hong; Chen, Peitsz; Wu, Chi-Shin; Dai, Hong-Jie.

J Med Internet Res ; 26: e48443, 2024 Jan 25.

Artigo em Inglês | MEDLINE | ID: mdl-38271060

RESUMO

BACKGROUND: The widespread use of electronic health records in the clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual forms, posing a challenge for deidentification. In multilingual countries, medical records could be written in a mixture of more than one language, referred to as code mixing. Most current clinical natural language processing techniques are designed for monolingual text, and there is a need to address the deidentification of code-mixed text. OBJECTIVE: The aim of this study was to investigate the effectiveness and underlying mechanism of fine-tuned pretrained language models (PLMs) in identifying PHI in the code-mixed context. Additionally, we aimed to evaluate the potential of prompting large language models (LLMs) for recognizing PHI in a zero-shot manner. METHODS: We compiled the first clinical code-mixed deidentification data set consisting of text written in Chinese and English. We explored the effectiveness of fine-tuned PLMs for recognizing PHI in code-mixed content, with a focus on whether PLMs exploit naming regularity and mention coverage to achieve superior performance, by probing the developed models' outputs to examine their decision-making process. Furthermore, we investigated the potential of prompt-based in-context learning of LLMs for recognizing PHI in code-mixed text. RESULTS: The developed methods were evaluated on a code-mixed deidentification corpus of 1700 discharge summaries. We observed that different PHI types had preferences in their occurrences within the different types of language-mixed sentences, and PLMs could effectively recognize PHI by exploiting the learned name regularity. However, the models may exhibit suboptimal results when regularity is weak or mentions contain unknown words that the representations cannot generate well. We also found that the availability of code-mixed training instances is essential for the model's performance. Furthermore, the LLM-based deidentification method was a feasible and appealing approach that can be controlled and enhanced through natural language prompts. CONCLUSIONS: The study contributes to understanding the underlying mechanism of PLMs in addressing the deidentification process in the code-mixed context and highlights the significance of incorporating code-mixed training instances into the model training phase. To support the advancement of research, we created a manipulated subset of the resynthesized data set available for research purposes. Based on the compiled data set, we found that the LLM-based deidentification method is a feasible approach, but carefully crafted prompts are essential to avoid unwanted output. However, the use of such methods in the hospital setting requires careful consideration of data security and privacy concerns. Further research could explore the augmentation of PLMs and LLMs with external knowledge to improve their strength in recognizing rare PHI.

Assuntos

Inteligência Artificial , Registros Eletrônicos de Saúde , Humanos , Processamento de Linguagem Natural , Privacidade , China

Protected Health Information Recognition of Unstructured Code-Mixed Electronic Health Records in Taiwan.

Lee, You-Qian; Wang, Bo-Hong; Su, Chu-Hsien; Chen, Pei-Tsz; Lin, Wu-Qing; Wu, Chi-Shin; Dai, Hong-Jie.

Stud Health Technol Inform ; 290: 627-631, 2022 Jun 06.

Artigo em Inglês | MEDLINE | ID: mdl-35673092

RESUMO

Electronic health records (EHRs) at medical institutions provide valuable sources for research in both clinical and biomedical domains. However, before such records can be used for research purposes, protected health information (PHI) mentioned in the unstructured text must be removed. In Taiwan's EHR systems the unstructured EHR texts are usually represented in the mixing of English and Chinese languages, which brings challenges for de-identification. This paper presented the first study, to the best of our knowledge, of the construction of a code-mixed EHR de-identification corpus and the evaluation of different mature entity recognition methods applied for the code-mixed PHI recognition task.

Assuntos

Confidencialidade , Registros Eletrônicos de Saúde , Idioma , Processamento de Linguagem Natural , Taiwan

Family History Information Extraction With Neural Attention and an Enhanced Relation-Side Scheme: Algorithm Development and Validation.

Dai, Hong-Jie; Lee, You-Qian; Nekkantti, Chandini; Jonnagaddala, Jitendra.

JMIR Med Inform ; 8(12): e21750, 2020 Dec 01.

Artigo em Inglês | MEDLINE | ID: mdl-33258777

RESUMO

BACKGROUND: Identifying and extracting family history information (FHI) from clinical reports are significant for recognizing disease susceptibility. However, FHI is usually described in a narrative manner within patients' electronic health records, which requires the application of natural language processing technologies to automatically extract such information to provide more comprehensive patient-centered information to physicians. OBJECTIVE: This study aimed to overcome the 2 main challenges observed in previous research focusing on FHI extraction. One is the requirement to develop postprocessing rules to infer the member and side information of family mentions. The other is to efficiently utilize intrasentence and intersentence information to assist FHI extraction. METHODS: We formulated the task as a sequential labeling problem and propose an enhanced relation-side scheme that encodes the required family member properties to not only eliminate the need for postprocessing rules but also relieve the insufficient training instance issues. Moreover, an attention-based neural network structure was proposed to exploit cross-sentence information to identify FHI and its attributes requiring cross-sentence inference. RESULTS: The dataset released by the 2019 n2c2/OHNLP family history extraction task was used to evaluate the performance of the proposed methods. We started by comparing the performance of the traditional neural sequence models with the ordinary scheme and enhanced scheme. Next, we studied the effectiveness of the proposed attention-enhanced neural networks by comparing their performance with that of the traditional networks. It was observed that, with the enhanced scheme, the recall of the neural network can be improved, leading to an increase in the F score of 0.024. The proposed neural attention mechanism enhanced both the recall and precision and resulted in an improved F score of 0.807, which was ranked fourth in the shared task. CONCLUSIONS: We presented an attention-based neural network along with an enhanced tag scheme that enables the neural network model to learn and interpret the implicit relationship and side information of the recognized family members across sentences without relying on heuristic rules.

Deep Learning-Based Natural Language Processing for Screening Psychiatric Patients.

Dai, Hong-Jie; Su, Chu-Hsien; Lee, You-Qian; Zhang, You-Chen; Wang, Chen-Kai; Kuo, Chian-Jue; Wu, Chi-Shin.

Front Psychiatry ; 11: 533949, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-33584354

RESUMO

The introduction of pre-trained language models in natural language processing (NLP) based on deep learning and the availability of electronic health records (EHRs) presents a great opportunity to transfer the "knowledge" learned from data in the general domain to enable the analysis of unstructured textual data in clinical domains. This study explored the feasibility of applying NLP to a small EHR dataset to investigate the power of transfer learning to facilitate the process of patient screening in psychiatry. A total of 500 patients were randomly selected from a medical center database. Three annotators with clinical experience reviewed the notes to make diagnoses for major/minor depression, bipolar disorder, schizophrenia, and dementia to form a small and highly imbalanced corpus. Several state-of-the-art NLP methods based on deep learning along with pre-trained models based on shallow or deep transfer learning were adapted to develop models to classify the aforementioned diseases. We hypothesized that the models that rely on transferred knowledge would be expected to outperform the models learned from scratch. The experimental results demonstrated that the models with the pre-trained techniques outperformed the models without transferred knowledge by micro-avg. and macro-avg. F-scores of 0.11 and 0.28, respectively. Our results also suggested that the use of the feature dependency strategy to build multi-labeling models instead of problem transformation is superior considering its higher performance and simplicity in the training process.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA