Search | VHL Regional Portal

1.

Admission characteristics of patients with short term hospitalization.

Frenkel Nir, Yael; Levy, Yuval; Grossman, Ehud; Klang, Eyal.

Isr J Health Policy Res ; 13(1): 51, 2024 Sep 26.

Article in English | MEDLINE | ID: mdl-39327571

ABSTRACT

BACKGROUND: Sheba Medical Center (SMC) is the largest hospital in Israel and has been coping with a steady increase in total Emergency Department (ED) visits. Over 140,000 patients arrive at the SMC's ED every year. Of those, 19% are admitted to the medical wards. Some are very short hospitalizations (one night or less). This puts a heavy burden on the medical wards. We aimed to identify the characteristics of short hospitalizations. METHODS: We retrospectively retrieved data of consecutive adult patients admitted to our hospital during January 1, 2013, to December 31, 2019. We limited the cohort to patients who were admitted to the medical wards. We divided the study group into those with short, those with non-short hospitalization and those who were discharged from the ED. RESULTS: Out of 133,126 admissions, 59,994 (45.0%) were hospitalized for short term. Patients in the short hospitalization group were younger and had fewer comorbidities. The highest rate of short hospitalization was recorded during night shifts (58.4%) and the rate of short hospitalization was associated with the ED daily patient load (r = 0.35, p < 0.001). The likelihood of having a short hospitalization was most prominent in patients with suicide attempt (80.0% of those admitted for this complaint had a short hospitalization), followed by hypertension (68.6%). However, these complaints accounted for only 0.7% of the total number of short hospitalizations. Cardiac and neurological complaints however, made up 27.4% of the short hospitalizations. The 30-days mortality rate was 7.0% in the non-short hospitalization group, 4.3% in the short hospitalization group and 0.9% in those who were discharged from the ED. CONCLUSIONS: Short hospitalizations in medical wards have special characteristics that may render them predictable. Increasing the rate of treating personnel per patient during peak hours and referring subsets of patients with cardiac and neurological complaints to ED-associated short term observation units may decrease short admissions to medical departments.

Subject(s)

Emergency Service, Hospital , Hospitalization , Patient Admission , Humans , Israel/epidemiology , Male , Female , Retrospective Studies , Middle Aged , Aged , Adult , Hospitalization/statistics & numerical data , Emergency Service, Hospital/statistics & numerical data , Patient Admission/statistics & numerical data , Patient Admission/trends , Length of Stay/statistics & numerical data

2.

Exploring the role of Large Language Models in haematology: A focused review of applications, benefits and limitations.

Mudrik, Aya; Nadkarni, Girish N; Efros, Orly; Glicksberg, Benjamin S; Klang, Eyal; Soffer, Shelly.

Br J Haematol ; 2024 Sep 03.

Article in English | MEDLINE | ID: mdl-39226157

ABSTRACT

Large language models (LLMs) have significantly impacted various fields with their ability to understand and generate human-like text. This study explores the potential benefits and limitations of integrating LLMs, such as ChatGPT, into haematology practices. Utilizing systematic review methodologies, we analysed studies published after 1 December 2022, from databases like PubMed, Web of Science and Scopus, and assessing each for bias with the QUADAS-2 tool. We reviewed 10 studies that applied LLMs in various haematology contexts. These models demonstrated proficiency in specific tasks, such as achieving 76% diagnostic accuracy for haemoglobinopathies. However, the research highlighted inconsistencies in performance and reference accuracy, indicating variability in reliability across different uses. Additionally, the limited scope of these studies and constraints on datasets could potentially limit the generalizability of our findings. The findings suggest that, while LLMs provide notable advantages in enhancing diagnostic processes and educational resources within haematology, their integration into clinical practice requires careful consideration. Before implementing them in haematology, rigorous testing and specific adaptation are essential. This involves validating their accuracy and reliability across different scenarios. Given the field's complexity, it is also critical to continuously monitor these models and adapt them responsively.

3.

The role of FDG PET/CT radiomics in the prediction of pathological response to neoadjuvant treatment in patients with esophageal cancer.

Eifer, Michal; Peters-Founshtein, Gregory; Yoel, Lotem Cohn; Pinian, Hodaya; Steiner, Roee; Klang, Eyal; Catalano, Onofrio A; Eshet, Yael; Domachevsky, Liran.

Rep Pract Oncol Radiother ; 29(2): 211-218, 2024.

Article in English | MEDLINE | ID: mdl-39143975

ABSTRACT

Background: Attainment of a complete histopathological response following neoadjuvant therapy has been associated with favorable long-term survival outcomes in esophageal cancer patients. We investigated the ability of 18F-fluorodeoxyglucose positron emission tomography/computed tomography (FDG PET/CT) radiomic features to predict the pathological response to neoadjuvant treatment in patients with esophageal cancer. Materials and methods: A retrospective review of medical records of patients with locally advanced resectable esophageal or esophagogastric junctional cancers. Included patients had a baseline FDG PET/CT scan and underwent Chemoradiotherapy for Oesophageal Cancer Followed by Surgery Study (CROSS) protocol followed by surgery. Four demographic variables and 107 PET radiomic features were extracted and analyzed using univariate and multivariate analyses to predict response to neoadjuvant therapy. Results: Overall, 53 FDG-avid primary esophageal cancer lesions were segmented and radiomic features were extracted. Seventeen radiomic features and 2 non-radiomics variables were found to exhibit significant differences between neoadjuvant therapy responders and non-responders. An unsupervised hierarchical clustering analysis using these 19 variables classified patients in a manner significantly associated with response to neoadjuvant treatment (p < 0.01). Conclusion: Our findings highlight the potential of FDG PET/CT radiomic features as a predictor for the response to neoadjuvant therapy in esophageal cancer patients. The combination of these radiomic features with select non-radiomic variables provides a model for stratifying patients based on their likelihood to respond to neoadjuvant treatment.

4.

Assessing GPT-4 multimodal performance in radiological image analysis.

Brin, Dana; Sorin, Vera; Barash, Yiftach; Konen, Eli; Glicksberg, Benjamin S; Nadkarni, Girish N; Klang, Eyal.

Eur Radiol ; 2024 Aug 30.

Article in English | MEDLINE | ID: mdl-39214893

ABSTRACT

OBJECTIVES: This study aims to assess the performance of a multimodal artificial intelligence (AI) model capable of analyzing both images and textual data (GPT-4V), in interpreting radiological images. It focuses on a range of modalities, anatomical regions, and pathologies to explore the potential of zero-shot generative AI in enhancing diagnostic processes in radiology. METHODS: We analyzed 230 anonymized emergency room diagnostic images, consecutively collected over 1 week, using GPT-4V. Modalities included ultrasound (US), computerized tomography (CT), and X-ray images. The interpretations provided by GPT-4V were then compared with those of senior radiologists. This comparison aimed to evaluate the accuracy of GPT-4V in recognizing the imaging modality, anatomical region, and pathology present in the images. RESULTS: GPT-4V identified the imaging modality correctly in 100% of cases (221/221), the anatomical region in 87.1% (189/217), and the pathology in 35.2% (76/216). However, the model's performance varied significantly across different modalities, with anatomical region identification accuracy ranging from 60.9% (39/64) in US images to 97% (98/101) and 100% (52/52) in CT and X-ray images (p < 0.001). Similarly, pathology identification ranged from 9.1% (6/66) in US images to 36.4% (36/99) in CT and 66.7% (34/51) in X-ray images (p < 0.001). These variations indicate inconsistencies in GPT-4V's ability to interpret radiological images accurately. CONCLUSION: While the integration of AI in radiology, exemplified by multimodal GPT-4, offers promising avenues for diagnostic enhancement, the current capabilities of GPT-4V are not yet reliable for interpreting radiological images. This study underscores the necessity for ongoing development to achieve dependable performance in radiology diagnostics. CLINICAL RELEVANCE STATEMENT: Although GPT-4V shows promise in radiological image interpretation, its high diagnostic hallucination rate (> 40%) indicates it cannot be trusted for clinical use as a standalone tool. Improvements are necessary to enhance its reliability and ensure patient safety. KEY POINTS: GPT-4V's capability in analyzing images offers new clinical possibilities in radiology. GPT-4V excels in identifying imaging modalities but demonstrates inconsistent anatomy and pathology detection. Ongoing AI advancements are necessary to enhance diagnostic reliability in radiological applications.

5.

Advancing radiology practice and research: harnessing the potential of large language models amidst imperfections.

Klang, Eyal; Alper, Lee; Sorin, Vera; Barash, Yiftach; Nadkarni, Girish N; Zimlichman, Eyal.

BJR Open ; 6(1): tzae022, 2024 Jan.

Article in English | MEDLINE | ID: mdl-39193585

ABSTRACT

Large language models (LLMs) are transforming the field of natural language processing (NLP). These models offer opportunities for radiologists to make a meaningful impact in their field. NLP is a part of artificial intelligence (AI) that uses computer algorithms to study and understand text data. Recent advances in NLP include the Attention mechanism and the Transformer architecture. Transformer-based LLMs, such as GPT-4 and Gemini, are trained on massive amounts of data and generate human-like text. They are ideal for analysing large text data in academic research and clinical practice in radiology. Despite their promise, LLMs have limitations, including their dependency on the diversity and quality of their training data and the potential for false outputs. Albeit these limitations, the use of LLMs in radiology holds promise and is gaining momentum. By embracing the potential of LLMs, radiologists can gain valuable insights and improve the efficiency of their work. This can ultimately lead to improved patient care.

6.

Traditional Machine Learning, Deep Learning, and BERT (Large Language Model) Approaches for Predicting Hospitalizations From Nurse Triage Notes: Comparative Evaluation of Resource Management.

Patel, Dhavalkumar; Timsina, Prem; Gorenstein, Larisa; Glicksberg, Benjamin S; Raut, Ganesh; Cheetirala, Satya Narayan; Santana, Fabio; Tamegue, Jules; Kia, Arash; Zimlichman, Eyal; Levin, Matthew A; Freeman, Robert; Klang, Eyal.

JMIR AI ; 3: e52190, 2024 Aug 27.

Article in English | MEDLINE | ID: mdl-39190905

ABSTRACT

BACKGROUND: Predicting hospitalization from nurse triage notes has the potential to augment care. However, there needs to be careful considerations for which models to choose for this goal. Specifically, health systems will have varying degrees of computational infrastructure available and budget constraints. OBJECTIVE: To this end, we compared the performance of the deep learning, Bidirectional Encoder Representations from Transformers (BERT)-based model, Bio-Clinical-BERT, with a bag-of-words (BOW) logistic regression (LR) model incorporating term frequency-inverse document frequency (TF-IDF). These choices represent different levels of computational requirements. METHODS: A retrospective analysis was conducted using data from 1,391,988 patients who visited emergency departments in the Mount Sinai Health System spanning from 2017 to 2022. The models were trained on 4 hospitals' data and externally validated on a fifth hospital's data. RESULTS: The Bio-Clinical-BERT model achieved higher areas under the receiver operating characteristic curve (0.82, 0.84, and 0.85) compared to the BOW-LR-TF-IDF model (0.81, 0.83, and 0.84) across training sets of 10,000; 100,000; and ~1,000,000 patients, respectively. Notably, both models proved effective at using triage notes for prediction, despite the modest performance gap. CONCLUSIONS: Our findings suggest that simpler machine learning models such as BOW-LR-TF-IDF could serve adequately in resource-limited settings. Given the potential implications for patient care and hospital resource management, further exploration of alternative models and techniques is warranted to enhance predictive performance in this critical domain. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.1101/2023.08.07.23293699.

7.

Computer-aided diagnosis of eyelid skin tumors using machine learning.

Zloto, Ofira; Fogel, Ofir; Ben Simon, Guy; Rosner, Mordechai; Vishnevskia-Dai, Vicktoria; Hostovsky, Avner; Klang, Eyal.

Can J Ophthalmol ; 2024 Aug 28.

Article in English | MEDLINE | ID: mdl-39214151

ABSTRACT

OBJECTIVE: To develop an automated, new framework based on machine learning to diagnose malignant eyelid skin tumors. METHODS: This study used eyelid lesion images from Sheba Medical Center, a large tertiary center in Israel. Before model training, we pretrained our models on the International Skin Imaging Collaboration (ISIC) 2019 dataset consisting of 25,332 images. The proprietary eyelid data set was then used for fine-tuning. The data set contained multiple images per patient, aiming to classify malignant lesions in comparison to benign counterparts. RESULTS: The analyzed data set consisted of images representing both benign and malignant eyelid lesions. For the benign category, a total of 373 images were sourced. By comparison, for the malignant category, 186 images were sourced. For the final model, at sensitivity of 93.8% (95% CI 80.0-100.0%), the model has a corresponding specificity of 73.7% (95% CI 60.0-87.1%). To further understand the decision-making process of our model, we employed heatmap visualization techniques, specifically gradient-weighted Class Activation Mapping. DISCUSSION: This study introduces a dependable model-aided diagnostic technology for assessing eyelid skin lesions. The model demonstrated accuracy comparable to human evaluation, effectively determining whether a lesion raises a high suspicion of malignancy or is benign. Such a model has the potential to alleviate the burden on the health care system, particularly benefiting rural areas, and enhancing the efficiency of clinicians and overall health care.

8.

Evaluating prompt engineering on GPT-3.5's performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4.

Patel, Dhavalkumar; Raut, Ganesh; Zimlichman, Eyal; Cheetirala, Satya Narayan; Nadkarni, Girish N; Glicksberg, Benjamin S; Apakama, Donald U; Bell, Elijah J; Freeman, Robert; Timsina, Prem; Klang, Eyal.

Sci Rep ; 14(1): 17341, 2024 07 28.

Article in English | MEDLINE | ID: mdl-39069520

ABSTRACT

This study was designed to assess how different prompt engineering techniques, specifically direct prompts, Chain of Thought (CoT), and a modified CoT approach, influence the ability of GPT-3.5 to answer clinical and calculation-based medical questions, particularly those styled like the USMLE Step 1 exams. To achieve this, we analyzed the responses of GPT-3.5 to two distinct sets of questions: a batch of 1000 questions generated by GPT-4, and another set comprising 95 real USMLE Step 1 questions. These questions spanned a range of medical calculations and clinical scenarios across various fields and difficulty levels. Our analysis revealed that there were no significant differences in the accuracy of GPT-3.5's responses when using direct prompts, CoT, or modified CoT methods. For instance, in the USMLE sample, the success rates were 61.7% for direct prompts, 62.8% for CoT, and 57.4% for modified CoT, with a p-value of 0.734. Similar trends were observed in the responses to GPT-4 generated questions, both clinical and calculation-based, with p-values above 0.05 indicating no significant difference between the prompt types. The conclusion drawn from this study is that the use of CoT prompt engineering does not significantly alter GPT-3.5's effectiveness in handling medical calculations or clinical scenario questions styled like those in USMLE exams. This finding is crucial as it suggests that performance of ChatGPT remains consistent regardless of whether a CoT technique is used instead of direct prompts. This consistency could be instrumental in simplifying the integration of AI tools like ChatGPT into medical education, enabling healthcare professionals to utilize these tools with ease, without the necessity for complex prompt engineering.

Subject(s)

Educational Measurement , Humans , Educational Measurement/methods , Licensure, Medical , Clinical Competence , United States , Education, Medical, Undergraduate/methods

9.

Responses From ChatGPT-4 Show Limited Correlation With Expert Consensus Statement on Anterior Shoulder Instability.

Artamonov, Alexander; Bachar-Avnieli, Ira; Klang, Eyal; Lubovsky, Omri; Atoun, Ehud; Bermant, Alexander; Rosinsky, Philip J.

Arthrosc Sports Med Rehabil ; 6(3): 100923, 2024 Jun.

Article in English | MEDLINE | ID: mdl-39006799

ABSTRACT

Purpose: To compare the similarity of answers provided by Generative Pretrained Transformer-4 (GPT-4) with those of a consensus statement on diagnosis, nonoperative management, and Bankart repair in anterior shoulder instability (ASI). Methods: An expert consensus statement on ASI published by Hurley et al. in 2022 was reviewed and questions laid out to the expert panel were extracted. GPT-4, the subscription version of ChatGPT, was queried using the same set of questions. Answers provided by GPT-4 were compared with those of the expert panel and subjectively rated for similarity by 2 experienced shoulder surgeons. GPT-4 was then used to rate the similarity of its own responses to the consensus statement, classifying them as low, medium, or high. Rates of similarity as classified by the shoulder surgeons and GPT-4 were then compared and interobserver reliability calculated using weighted κ scores. Results: The degree of similarity between responses of GPT-4 and the ASI consensus statement, as defined by shoulder surgeons, was high in 25.8%, medium in 45.2%, and low 29% of questions. GPT-4 assessed similarity as high in 48.3%, medium in 41.9%, and low 9.7% of questions. Surgeons and GPT-4 reached consensus on the classification of 18 questions (58.1%) and disagreement on 13 questions (41.9%). Conclusions: The responses generated by artificial intelligence exhibit limited correlation with an expert statement on the diagnosis and treatment of ASI. Clinical Relevance: As the use of artificial intelligence becomes more prevalent, it is important to understand how closely information resembles content produced by human authors.

10.

Applications of large language models in psychiatry: a systematic review.

Omar, Mahmud; Soffer, Shelly; Charney, Alexander W; Landi, Isotta; Nadkarni, Girish N; Klang, Eyal.

Front Psychiatry ; 15: 1422807, 2024.

Article in English | MEDLINE | ID: mdl-38979501

ABSTRACT

Background: With their unmatched ability to interpret and engage with human language and context, large language models (LLMs) hint at the potential to bridge AI and human cognitive processes. This review explores the current application of LLMs, such as ChatGPT, in the field of psychiatry. Methods: We followed PRISMA guidelines and searched through PubMed, Embase, Web of Science, and Scopus, up until March 2024. Results: From 771 retrieved articles, we included 16 that directly examine LLMs' use in psychiatry. LLMs, particularly ChatGPT and GPT-4, showed diverse applications in clinical reasoning, social media, and education within psychiatry. They can assist in diagnosing mental health issues, managing depression, evaluating suicide risk, and supporting education in the field. However, our review also points out their limitations, such as difficulties with complex cases and potential underestimation of suicide risks. Conclusion: Early research in psychiatry reveals LLMs' versatile applications, from diagnostic support to educational roles. Given the rapid pace of advancement, future investigations are poised to explore the extent to which these models might redefine traditional roles in mental health care.

11.

Duration-Dependent Risk of Hypoxemia in Colonoscopy Procedures.

Klang, Eyal; Sharif, Kassem; Ukashi, Offir; Rahman, Nisim; Lahat, Adi.

J Clin Med ; 13(13)2024 Jun 24.

Article in English | MEDLINE | ID: mdl-38999246

ABSTRACT

Background and Aims: Colonoscopy is a critical diagnostic and therapeutic procedure in gastroenterology. However, it carries risks, including hypoxemia, which can impact patient safety. Understanding the factors that contribute to the incidence of severe hypoxemia, specifically the role of procedure duration, is essential for improving patient outcomes. This study aims to elucidate the relationship between the length of colonoscopy procedures and the occurrence of severe hypoxemia. Methods: We conducted a retrospective cohort study at Sheba Medical Center, Israel, including 21,524 adult patients who underwent colonoscopy from January 2020 to January 2024. The study focused on the incidence of severe hypoxemia, defined as a drop in oxygen saturation below 90%. Sedation protocols, involving a combination of Fentanyl, Midazolam, and Propofol were personalized based on the endoscopist's discretion. Data were collected from electronic health records, covering patient demographics, clinical scores, sedation and procedure details, and outcomes. Statistical analyses, including logistic regression, were used to examine the association between procedure duration and hypoxemia, adjusting for various patient and procedural factors. Results: We initially collected records of 26,569 patients who underwent colonoscopy, excluding 5045 due to incomplete data, resulting in a final cohort of 21,524 patients. Procedures under 20 min comprised 48.9% of the total, while those lasting 20-40 min made up 50.7%. Only 8.5% lasted 40-60 min, and 2.9% exceeded 60 min. Longer procedures correlated with higher hypoxemia risk: 17.3% for <20 min, 24.2% for 20-40 min, 32.4% for 40-60 min, and 36.1% for ≥60 min. Patients aged 60-80 and ≥80 had increased hypoxemia odds (aOR 1.1, 95% CI 1.0-1.2 and aOR 1.2, 95% CI 1.0-1.4, respectively). Procedure durations of 20-40 min, 40-60 min, and over 60 min had aORs of 1.5 (95% CI 1.4-1.6), 2.1 (95% CI 1.9-2.4), and 2.4 (95% CI 2.0-3.0), respectively. Conclusions: The duration of colonoscopy procedures significantly impacts the risk of severe hypoxemia, with longer durations associated with higher risks. This study underscores the importance of optimizing procedural efficiency and tailoring sedation protocols to individual patient risk profiles to enhance the safety of colonoscopy. Further research is needed to develop strategies that minimize procedure duration without compromising the quality of care, thereby reducing the risk of hypoxemia and improving patient safety.

12.

Machine learning in cardiac stress test interpretation: a systematic review.

Hadida Barzilai, Dor; Cohen-Shelly, Michal; Sorin, Vera; Zimlichman, Eyal; Massalha, Eias; Allison, Thomas G; Klang, Eyal.

Eur Heart J Digit Health ; 5(4): 401-408, 2024 Jul.

Article in English | MEDLINE | ID: mdl-39081945

ABSTRACT

Coronary artery disease (CAD) is a leading health challenge worldwide. Exercise stress testing is a foundational non-invasive diagnostic tool. Nonetheless, its variable accuracy prompts the exploration of more reliable methods. Recent advancements in machine learning (ML), including deep learning and natural language processing, have shown potential in refining the interpretation of stress testing data. Adhering to Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we conducted a systematic review of ML applications in stress electrocardiogram (ECG) and stress echocardiography for CAD prognosis. Medical Literature Analysis and Retrieval System Online, Web of Science, and the Cochrane Library were used as databases. We analysed the ML models, outcomes, and performance metrics. Overall, seven relevant studies were identified. Machine-learning applications in stress ECGs resulted in sensitivity and specificity improvements. Some models achieved rates of above 96% in both metrics and reduced false positives by up to 21%. In stress echocardiography, ML models demonstrated an increase in diagnostic precision. Some models achieved specificity and sensitivity rates of up to 92.7 and 84.4%, respectively. Natural language processing applications enabled the categorization of stress echocardiography reports, with accuracy rates nearing 98%. Limitations include a small, retrospective study pool and the exclusion of nuclear stress testing, due to its well-documented status. This review indicates the potential of artificial intelligence applications in refining CAD stress testing assessment. Further development for real-world use is warranted.

13.

Utilizing ChatGPT to Facilitate Referrals for Fetal Echocardiography.

Gordin Kopylov, Lital; Goldrat, Itai; Maymon, Ron; Svirsky, Ran; Wiener, Yifat; Klang, Eyal.

Fetal Diagn Ther ; : 1-4, 2024 Jun 04.

Article in English | MEDLINE | ID: mdl-38834046

ABSTRACT

INTRODUCTION: OpenAI's GPT-4 (artificial intelligence [AI]) is being studied for its use as a medical decision support tool. This research examines its accuracy in refining referrals for fetal echocardiography (FE) to improve early detection and outcomes related to congenital heart defects (CHDs). METHODS: Past FE data referred to our institution were evaluated separately by pediatric cardiologist, gynecologist (human experts [experts]), and AI, according to established guidelines. We compared experts and AI's agreement on referral necessity, with experts addressing discrepancies. RESULTS: Total of 59 FE cases were addressed retrospectively. Cardiologist, gynecologist, and AI recommended performing FE in 47.5%, 49.2%, and 59.0% of cases, respectively. Comparing AI recommendations to experts indicated agreement of around 80.0% with both experts (p < 0.001). Notably, AI suggested more echocardiographies for minor CHD (64.7%) compared to experts (47.1%), and for major CHD, experts recommended performing FE in all cases (100%) while AI recommended in majority of cases (90.9%). Discrepancies between AI and experts are detailed and reviewed. CONCLUSIONS: The evaluation found moderate agreement between AI and experts. Contextual misunderstandings and lack of specialized medical knowledge limit AI, necessitating clinical guideline guidance. Despite shortcomings, AI's referrals comprised 65% of minor CHD cases versus experts 47%, suggesting its potential as a cautious decision aid for clinicians.

14.

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.

Lahat, Adi; Sharif, Kassem; Zoabi, Narmin; Shneor Patt, Yonatan; Sharif, Yousra; Fisher, Lior; Shani, Uria; Arow, Mohamad; Levin, Roni; Klang, Eyal.

J Med Internet Res ; 26: e54571, 2024 Jun 27.

Article in English | MEDLINE | ID: mdl-38935937

ABSTRACT

BACKGROUND: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. OBJECTIVE: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors' and residents' ratings, and specific question types. METHODS: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. RESULTS: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions. CONCLUSIONS: ChatGPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.

Subject(s)

Clinical Decision-Making , Humans , Artificial Intelligence

15.

Deep learning in magnetic resonance enterography for Crohn's disease assessment: a systematic review.

Brem, Ofir; Elisha, David; Konen, Eli; Amitai, Michal; Klang, Eyal.

Abdom Radiol (NY) ; 49(9): 3183-3189, 2024 Sep.

Article in English | MEDLINE | ID: mdl-38693270

ABSTRACT

Crohn's disease (CD) poses significant morbidity, underscoring the need for effective, non-invasive inflammatory assessment using magnetic resonance enterography (MRE). This literature review evaluates recent publications on the role of deep learning in improving MRE for CD assessment. We searched MEDLINE/PUBMED for studies that reported the use of deep learning algorithms for assessment of CD activity. The study was conducted according to the PRISMA guidelines. The risk of bias was evaluated using the QUADAS-2 tool. Five eligible studies, encompassing 468 subjects, were identified. Our study suggests that diverse deep learning applications, including image quality enhancement, bowel segmentation for disease burden quantification, and 3D reconstruction for surgical planning are useful and promising for CD assessment. However, most of the studies are preliminary, retrospective studies, and have a high risk of bias in at least one category. Future research is needed to assess how deep learning can impact CD patient diagnostics, particularly when considering the increasing integration of such models into hospital systems.

Subject(s)

Crohn Disease , Deep Learning , Magnetic Resonance Imaging , Humans , Crohn Disease/diagnostic imaging , Magnetic Resonance Imaging/methods , Image Interpretation, Computer-Assisted/methods

16.

Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room.

Glicksberg, Benjamin S; Timsina, Prem; Patel, Dhaval; Sawant, Ashwin; Vaid, Akhil; Raut, Ganesh; Charney, Alexander W; Apakama, Donald; Carr, Brendan G; Freeman, Robert; Nadkarni, Girish N; Klang, Eyal.

J Am Med Inform Assoc ; 31(9): 1921-1928, 2024 Sep 01.

Article in English | MEDLINE | ID: mdl-38771093

ABSTRACT

BACKGROUND: Artificial intelligence (AI) and large language models (LLMs) can play a critical role in emergency room operations by augmenting decision-making about patient admission. However, there are no studies for LLMs using real-world data and scenarios, in comparison to and being informed by traditional supervised machine learning (ML) models. We evaluated the performance of GPT-4 for predicting patient admissions from emergency department (ED) visits. We compared performance to traditional ML models both naively and when informed by few-shot examples and/or numerical probabilities. METHODS: We conducted a retrospective study using electronic health records across 7 NYC hospitals. We trained Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively, and created an ensemble model reflecting ML performance. We then assessed GPT-4 capabilities in many scenarios: through Zero-shot, Few-shot with and without retrieval-augmented generation (RAG), and with and without ML numerical probabilities. RESULTS: The Ensemble ML model achieved an area under the receiver operating characteristic curve (AUC) of 0.88, an area under the precision-recall curve (AUPRC) of 0.72 and an accuracy of 82.9%. The naïve GPT-4's performance (0.79 AUC, 0.48 AUPRC, and 77.5% accuracy) showed substantial improvement when given limited, relevant data to learn from (ie, RAG) and underlying ML probabilities (0.87 AUC, 0.71 AUPRC, and 83.1% accuracy). Interestingly, RAG alone boosted performance to near peak levels (0.82 AUC, 0.56 AUPRC, and 81.3% accuracy). CONCLUSIONS: The naïve LLM had limited performance but showed significant improvement in predicting ED admissions when supplemented with real-world examples to learn from, particularly through RAG, and/or numerical probabilities from traditional ML models. Its peak performance, although slightly lower than the pure ML model, is noteworthy given its potential for providing reasoning behind predictions. Further refinement of LLMs with real-world data is necessary for successful integration as decision-support tools in care settings.

Subject(s)

Electronic Health Records , Emergency Service, Hospital , Patient Admission , Humans , Retrospective Studies , Artificial Intelligence , Natural Language Processing , Machine Learning , Supervised Machine Learning

17.

Obesity Is Associated with Fatty Liver and Fat Changes in the Kidneys in Humans as Assessed by MRI.

Raphael, Hadar; Klang, Eyal; Konen, Eli; Inbar, Yael; Leibowitz, Avshalom; Frenkel-Nir, Yael; Apter, Sara; Grossman, Ehud.

Nutrients ; 16(9)2024 May 03.

Article in English | MEDLINE | ID: mdl-38732633

ABSTRACT

BACKGROUND: Obesity is associated with metabolic syndrome and fat accumulation in various organs such as the liver and the kidneys. Our goal was to assess, using magnetic resonance imaging (MRI) Dual-Echo phase sequencing, the association between liver and kidney fat deposition and their relation to obesity. METHODS: We analyzed MRI scans of individuals who were referred to the Chaim Sheba Medical Center between December 2017 and May 2020 to perform a study for any indication. For each individual, we retrieved from the computerized charts data on sex, and age, weight, height, body mass index (BMI), systolic and diastolic blood pressure (BP), and comorbidities (diabetes mellitus, hypertension, dyslipidemia). RESULTS: We screened MRI studies of 399 subjects with a median age of 51 years, 52.4% of whom were women, and a median BMI 24.6 kg/m2. We diagnosed 18% of the participants with fatty liver and 18.6% with fat accumulation in the kidneys (fatty kidneys). Out of the 67 patients with fatty livers, 23 (34.3%) also had fatty kidneys, whereas among the 315 patients without fatty livers, only 48 patients (15.2%) had fatty kidneys (p < 0.01). In comparison to the patients who did not have a fatty liver or fatty kidneys (n = 267), those who had both (n = 23) were more obese, had higher systolic BP, and were more likely to have diabetes mellitus. In comparison to the patients without a fatty liver, those with fatty livers had an adjusted odds ratio of 2.91 (97.5% CI; 1.61-5.25) to have fatty kidneys. In total, 19.6% of the individuals were obese (BMI ≥ 30), and 26.1% had overweight (25 < BMI < 30). The obese and overweight individuals were older and more likely to have diabetes mellitus and hypertension and had higher rates of fatty livers and fatty kidneys. Fat deposition in both the liver and the kidneys was observed in 15.9% of the obese patients, in 8.3% of the overweight patients, and in none of those with normal weight. Obesity was the only risk factor for fatty kidneys and fatty livers, with an adjusted OR of 6.3 (97.5% CI 2.1-18.6). CONCLUSIONS: Obesity is a major risk factor for developing a fatty liver and fatty kidneys. Individuals with a fatty liver are more likely to have fatty kidneys. MRI is an accurate modality for diagnosing fatty kidneys. Reviewing MRI scans of any indication should include assessment of fat fractions in the kidneys in addition to that of the liver.

Subject(s)

Fatty Liver , Kidney , Magnetic Resonance Imaging , Obesity , Humans , Female , Male , Middle Aged , Obesity/complications , Kidney/diagnostic imaging , Kidney/physiopathology , Adult , Fatty Liver/diagnostic imaging , Fatty Liver/epidemiology , Body Mass Index , Liver/diagnostic imaging , Liver/pathology , Kidney Diseases/diagnostic imaging , Kidney Diseases/epidemiology , Aged , Risk Factors

18.

Utilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: A systematic review.

Omar, Mahmud; Brin, Dana; Glicksberg, Benjamin; Klang, Eyal.

Am J Infect Control ; 52(9): 992-1001, 2024 Sep.

Article in English | MEDLINE | ID: mdl-38588980

ABSTRACT

BACKGROUND: Natural Language Processing (NLP) and Large Language Models (LLMs) hold largely untapped potential in infectious disease management. This review explores their current use and uncovers areas needing more attention. METHODS: This analysis followed systematic review procedures, registered with the Prospective Register of Systematic Reviews. We conducted a search across major databases including PubMed, Embase, Web of Science, and Scopus, up to December 2023, using keywords related to NLP, LLM, and infectious diseases. We also employed the Quality Assessment of Diagnostic Accuracy Studies-2 tool for evaluating the quality and robustness of the included studies. RESULTS: Our review identified 15 studies with diverse applications of NLP in infectious disease management. Notable examples include GPT-4's application in detecting urinary tract infections and BERTweet's use in Lyme Disease surveillance through social media analysis. These models demonstrated effective disease monitoring and public health tracking capabilities. However, the effectiveness varied across studies. For instance, while some NLP tools showed high accuracy in pneumonia detection and high sensitivity in identifying invasive mold diseases from medical reports, others fell short in areas like bloodstream infection management. CONCLUSIONS: This review highlights the yet-to-be-fully-realized promise of NLP and LLMs in infectious disease management. It calls for more exploration to fully harness AI's capabilities, particularly in the areas of diagnosis, surveillance, predicting disease courses, and tracking epidemiological trends.

Subject(s)

Communicable Diseases , Natural Language Processing , Humans , Communicable Diseases/diagnosis

19.

ChatGPT's adherence to otolaryngology clinical practice guidelines.

Tessler, Idit; Wolfovitz, Amit; Alon, Eran E; Gecel, Nir A; Livneh, Nir; Zimlichman, Eyal; Klang, Eyal.

Eur Arch Otorhinolaryngol ; 281(7): 3829-3834, 2024 Jul.

Article in English | MEDLINE | ID: mdl-38647684

ABSTRACT

OBJECTIVES: Large language models, including ChatGPT, has the potential to transform the way we approach medical knowledge, yet accuracy in clinical topics is critical. Here we assessed ChatGPT's performance in adhering to the American Academy of Otolaryngology-Head and Neck Surgery guidelines. METHODS: We presented ChatGPT with 24 clinical otolaryngology questions based on the guidelines of the American Academy of Otolaryngology. This was done three times (N = 72) to test the model's consistency. Two otolaryngologists evaluated the responses for accuracy and relevance to the guidelines. Cohen's Kappa was used to measure evaluator agreement, and Cronbach's alpha assessed the consistency of ChatGPT's responses. RESULTS: The study revealed mixed results; 59.7% (43/72) of ChatGPT's responses were highly accurate, while only 2.8% (2/72) directly contradicted the guidelines. The model showed 100% accuracy in Head and Neck, but lower accuracy in Rhinology and Otology/Neurotology (66%), Laryngology (50%), and Pediatrics (8%). The model's responses were consistent in 17/24 (70.8%), with a Cronbach's alpha value of 0.87, indicating a reasonable consistency across tests. CONCLUSIONS: Using a guideline-based set of structured questions, ChatGPT demonstrates consistency but variable accuracy in otolaryngology. Its lower performance in some areas, especially Pediatrics, suggests that further rigorous evaluation is needed before considering real-world clinical use.

Subject(s)

Guideline Adherence , Otolaryngology , Practice Guidelines as Topic , Otolaryngology/standards , Humans , United States

20.

Navigating the vestibular maze: text-mining analysis of publication trends over five decades.

Wolfovitz, Amit; Gecel, Nir A; Gimmon, Yoav; Shivatzki, Shaked; Sorin, Vera; Barash, Yiftach; Klang, Eyal; Tessler, Idit.

Front Neurol ; 15: 1292640, 2024.

Article in English | MEDLINE | ID: mdl-38560730

ABSTRACT

Introduction: The field of vestibular science, encompassing the study of the vestibular system and associated disorders, has experienced notable growth and evolving trends over the past five decades. Here, we explore the changing landscape in vestibular science, focusing on epidemiology, peripheral pathologies, diagnosis methods, treatment, and technological advancements. Methods: Publication data was obtained from the US National Center for Biotechnology Information (NCBI) PubMed database. The analysis included epidemiological, etiological, diagnostic, and treatment-focused studies on peripheral vestibular disorders, with a particular emphasis on changes in topics and trends of publications over time. Results: Our dataset of 39,238 publications revealed a rising trend in research across all age groups. Etiologically, benign paroxysmal positional vertigo (BPPV) and Meniere's disease were the most researched conditions, but the prevalence of studies on vestibular migraine showed a marked increase in recent years. Electronystagmography (ENG)/ Videonystagmography (VNG) and Vestibular Evoked Myogenic Potential (VEMP) were the most commonly discussed diagnostic tools, while physiotherapy stood out as the primary treatment modality. Conclusion: Our study presents a unique opportunity and point of view, exploring the evolving landscape of vestibular science publications over the past five decades. The analysis underscored the dynamic nature of the field, highlighting shifts in focus and emerging publication trends in diagnosis and treatment over time.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL