Search | VHL Regional Portal

An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study.

Wang, Lei; Ma, Yinyao; Bi, Wenshuai; Lv, Hanlin; Li, Yuxiang.

J Med Internet Res ; 26: e54580, 2024 Mar 29.

Article in English | MEDLINE | ID: mdl-38551633

ABSTRACT

BACKGROUND: The study of disease progression relies on clinical data, including text data, and extracting valuable features from text data has been a research hot spot. With the rise of large language models (LLMs), semantic-based extraction pipelines are gaining acceptance in clinical research. However, the security and feature hallucination issues of LLMs require further attention. OBJECTIVE: This study aimed to introduce a novel modular LLM pipeline, which could semantically extract features from textual patient admission records. METHODS: The pipeline was designed to process a systematic succession of concept extraction, aggregation, question generation, corpus extraction, and question-and-answer scale extraction, which was tested via 2 low-parameter LLMs: Qwen-14B-Chat (QWEN) and Baichuan2-13B-Chat (BAICHUAN). A data set of 25,709 pregnancy cases from the People's Hospital of Guangxi Zhuang Autonomous Region, China, was used for evaluation with the help of a local expert's annotation. The pipeline was evaluated with the metrics of accuracy and precision, null ratio, and time consumption. Additionally, we evaluated its performance via a quantified version of Qwen-14B-Chat on a consumer-grade GPU. RESULTS: The pipeline demonstrates a high level of precision in feature extraction, as evidenced by the accuracy and precision results of Qwen-14B-Chat (95.52% and 92.93%, respectively) and Baichuan2-13B-Chat (95.86% and 90.08%, respectively). Furthermore, the pipeline exhibited low null ratios and variable time consumption. The INT4-quantified version of QWEN delivered an enhanced performance with 97.28% accuracy and a 0% null ratio. CONCLUSIONS: The pipeline exhibited consistent performance across different LLMs and efficiently extracted clinical features from textual data. It also showed reliable performance on consumer-grade hardware. This approach offers a viable and effective solution for mining clinical research data from textual records.

Subject(s)

Data Mining , Electronic Health Records , Humans , Data Mining/methods , Natural Language Processing , China , Language

An early screening model for preeclampsia: utilizing zero-cost maternal predictors exclusively.

Wang, Lei; Ma, Yinyao; Bi, Wenshuai; Meng, Chenwei; Liang, Xuxia; Wu, Hua; Zhang, Chun; Wang, Xiaogang; Lv, Hanlin; Li, Yuxiang.

Hypertens Res ; 47(4): 1051-1062, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38326453

ABSTRACT

To provide a reliable, low-cost screening model for preeclampsia, this study developed an early screening model in a retrospective cohort (25,709 pregnancies) and validated in a validation cohort (1760 pregnancies). A data augmentation method (α-inverse weighted-GMM + RUS) was applied to a retrospective cohort before 10 machine learning models were simultaneously trained on augmented data, and the optimal model was chosen via sensitivity (at a false positive rate of 10%). The AdaBoost model, utilizing 16 predictors, was chosen as the final model, achieving a performance beyond acceptable with Area Under the Receiver Operating Characteristic Curve of 0.8008 and sensitivity of 0.5190. All predictors were derived from clinical characteristics, some of which were previously unreported (such as nausea and vomiting in pregnancy and menstrual cycle irregularity). Compared to previous studies, our model demonstrated superior performance, exhibiting at least a 50% improvement in sensitivity over checklist-based approaches, and a minimum of 28% increase over multivariable models that solely utilized maternal predictors. We validated an effective approach for preeclampsia early screening incorporating zero-cost predictors, which demonstrates superior performance in comparison to similar studies. We believe the application of the approach in combination with high performance approaches could substantially increase screening participation rate among pregnancies. Machine learning model for early preeclampsia screening, using 16 zero-cost predictors derived from clinical characteristics, was built on a 10-year Chinese cohort. The model outperforms similar research by at least 28%; validated on an independent cohort.

Subject(s)

Pre-Eclampsia , Pregnancy , Female , Humans , Pre-Eclampsia/diagnosis , Pregnancy Trimester, First , Retrospective Studies , Risk Assessment/methods , Prospective Studies , Biomarkers

Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study.

Wang, Lei; Bi, Wenshuai; Zhao, Suling; Ma, Yinyao; Lv, Longting; Meng, Chenwei; Fu, Jingru; Lv, Hanlin.

JMIR Form Res ; 8: e53216, 2024 Feb 08.

Article in English | MEDLINE | ID: mdl-38329787

ABSTRACT

BACKGROUND: The accumulation of vast electronic medical records (EMRs) through medical informatization creates significant research value, particularly in obstetrics. Diagnostic standardization across different health care institutions and regions is vital for medical data analysis. Large language models (LLMs) have been extensively used for various medical tasks. Prompt engineering is key to use LLMs effectively. OBJECTIVE: This study aims to evaluate and compare the performance of LLMs with various prompt engineering techniques on the task of standardizing obstetric diagnostic terminology using real-world obstetric data. METHODS: The paper describes a 4-step approach used for mapping diagnoses in electronic medical records to the International Classification of Diseases, 10th revision, observation domain. First, similarity measures were used for mapping the diagnoses. Second, candidate mapping terms were collected based on similarity scores above a threshold, to be used as the training data set. For generating optimal mapping terms, we used two LLMs (ChatGLM2 and Qwen-14B-Chat [QWEN]) for zero-shot learning in step 3. Finally, a performance comparison was conducted by using 3 pretrained bidirectional encoder representations from transformers (BERTs), including BERT, whole word masking BERT, and momentum contrastive learning with BERT (MC-BERT), for unsupervised optimal mapping term generation in the fourth step. RESULTS: LLMs and BERT demonstrated comparable performance at their respective optimal levels. LLMs showed clear advantages in terms of performance and efficiency in unsupervised settings. Interestingly, the performance of the LLMs varied significantly across different prompt engineering setups. For instance, when applying the self-consistency approach in QWEN, the F1-score improved by 5%, with precision increasing by 7.9%, outperforming the zero-shot method. Likewise, ChatGLM2 delivered similar rates of accurately generated responses. During the analysis, the BERT series served as a comparative model with comparable results. Among the 3 models, MC-BERT demonstrated the highest level of performance. However, the differences among the versions of BERT in this study were relatively insignificant. CONCLUSIONS: After applying LLMs to standardize diagnoses and designing 4 different prompts, we compared the results to those generated by the BERT model. Our findings indicate that QWEN prompts largely outperformed the other prompts, with precision comparable to that of the BERT model. These results demonstrate the potential of unsupervised approaches in improving the efficiency of aligning diagnostic terms in daily research and uncovering hidden information values in patient data.

Identification of downstream targets and signaling pathways of long non-coding RNA NR_002794 in human trophoblast cells.

Ma, Yinyao; Wu, Hua; Liang, Xuxia; Zhang, Chun; Ma, Yanhua; Wei, Yanfen; Li, Jing; Chen, Hui.

Bioengineered ; 12(1): 6617-6628, 2021 12.

Article in English | MEDLINE | ID: mdl-34516352

ABSTRACT

Preeclampsia (PE) is a huge threat to pregnant women. Our previous study demonstrated that long non-coding RNA (lncRNA) NR_002794 was highly expressed in placentas of PE patients and could regulate the phenotypes of trophoblast cells. However, the downstream regulatory mechanisms of NR_002794 remain unknown. In this text, some potential downstream targets or signaling pathways of NR_002794 were identified through RNA sequencing (RNA-seq) and bioinformatics analysis in SWAN71 trophoblast cells. Western blot assay demonstrated that NR_002794 inactivated protein kinase B (AKT) and extracellular signal-regulated kinase 1/2 (ERK1/2) pathways and activated cell apoptotic signaling in SWAN71 cells. Both RNA-seq and reverse transcription-quantitative PCR (RT-qPCR) outcomes showed that NR_002794 up-regulation could notably inhibit the expression of C-C motif chemokine ligand 4 like 2 (CCL4L2), interleukin 15 receptor subunit alpha (IL15RA), interleukin 32 (IL32), and tyrosine kinase with immunoglobulin-like and EGF-like domains 1 (TIE1), while NR_002794 knockdown induced these gene expressions in SWAN71 cells. CCK-8, BrdU, Transwell, wound healing, and flow cytometry analyses showed that NR_002794 inhibited cell proliferation and migration and induced cell apoptosis through down-regulating TIE1 in SWAN71 cells. In conclusion, lncRNA NR_002794 could exert its functions by regulating AKT and ERK1/2 pathways and TIE1 expression in human trophoblast cells.

Subject(s)

MAP Kinase Signaling System/genetics , RNA, Long Noncoding/genetics , Trophoblasts/metabolism , Cell Line , Extracellular Signal-Regulated MAP Kinases/genetics , Extracellular Signal-Regulated MAP Kinases/metabolism , Female , Humans , Pre-Eclampsia , Pregnancy , Proto-Oncogene Proteins c-akt/genetics , Proto-Oncogene Proteins c-akt/metabolism , Receptor, TIE-1/genetics , Receptor, TIE-1/metabolism

Long noncoding RNA NR_002794 is upregulated in preeclampsia and regulates the proliferation, apoptosis and invasion of trophoblast cells.

Ma, Yinyao; Liang, Xuxia; Wu, Hua; Zhang, Chun; Ma, Yanhua.

Mol Med Rep ; 20(5): 4567-4575, 2019 Nov.

Article in English | MEDLINE | ID: mdl-31702023

ABSTRACT

Preeclampsia is a common complication during pregnancy, characterized by hypertension and proteinuria. The pathogenesis of preeclampsia is not fully understood. Studies on the maternal spiral artery have led scientists to consider that the ineffective infiltration of placental trophoblast cells may be a primary cause of preeclampsia. The present study aimed to investigate the differences in the profiles of long noncoding RNAs (lncRNAs) between the placentas of patients with preeclampsia and those of healthy pregnant women. The involvement of the differentially expressed lncRNAs in the biological activity of trophoblast cells was also assessed. A total of 26 differentially expressed lncRNAs were identified between the preeclampsia and healthy groups. Upregulation of NR_002794 was found in tissues from patients with preeclampsia. In SWAN71 trophoblast cells, NR_002794 had suppressive effects on proliferation and migration, and resulted in an increased rate of apoptosis. Furthermore, lncRNA NR_002794 had no effect on the phagocytosis of trophoblast cells. The present study suggested that abnormal levels of NR_002794 may lead to atypical conditions in trophoblast cells, which may be associated with the failure of maternal spiral artery remodeling during pregnancy and, consequently, with the development of preeclampsia.

Subject(s)

Apoptosis , Cell Proliferation , Pre-Eclampsia/metabolism , RNA, Long Noncoding/biosynthesis , Trophoblasts/metabolism , Up-Regulation , Cell Line , Female , Humans , Phagocytosis , Pre-Eclampsia/genetics , Pre-Eclampsia/pathology , Pregnancy , RNA, Long Noncoding/genetics , Trophoblasts/pathology

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL