Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 4.638
Filtrar
1.
J Med Internet Res ; 26: e50236, 2024 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-39088259

RESUMO

BACKGROUND: Patients increasingly rely on web-based physician reviews to choose a physician and share their experiences. However, the unstructured text of these written reviews presents a challenge for researchers seeking to make inferences about patients' judgments. Methods previously used to identify patient judgments within reviews, such as hand-coding and dictionary-based approaches, have posed limitations to sample size and classification accuracy. Advanced natural language processing methods can help overcome these limitations and promote further analysis of physician reviews on these popular platforms. OBJECTIVE: This study aims to train, test, and validate an advanced natural language processing algorithm for classifying the presence and valence of 2 dimensions of patient judgments in web-based physician reviews: interpersonal manner and technical competence. METHODS: We sampled 345,053 reviews for 167,150 physicians across the United States from Healthgrades.com, a commercial web-based physician rating and review website. We hand-coded 2000 written reviews and used those reviews to train and test a transformer classification algorithm called the Robustly Optimized BERT (Bidirectional Encoder Representations from Transformers) Pretraining Approach (RoBERTa). The 2 fine-tuned models coded the reviews for the presence and positive or negative valence of patients' interpersonal manner or technical competence judgments of their physicians. We evaluated the performance of the 2 models against 200 hand-coded reviews and validated the models using the full sample of 345,053 RoBERTa-coded reviews. RESULTS: The interpersonal manner model was 90% accurate with precision of 0.89, recall of 0.90, and weighted F1-score of 0.89. The technical competence model was 90% accurate with precision of 0.91, recall of 0.90, and weighted F1-score of 0.90. Positive-valence judgments were associated with higher review star ratings whereas negative-valence judgments were associated with lower star ratings. Analysis of the data by review rating and physician gender corresponded with findings in prior literature. CONCLUSIONS: Our 2 classification models coded interpersonal manner and technical competence judgments with high precision, recall, and accuracy. These models were validated using review star ratings and results from previous research. RoBERTa can accurately classify unstructured, web-based review text at scale. Future work could explore the use of this algorithm with other textual data, such as social media posts and electronic health records.


Assuntos
Algoritmos , Internet , Processamento de Linguagem Natural , Humanos , Feminino , Masculino , Médicos , Relações Médico-Paciente , Julgamento , Adulto , Pessoa de Meia-Idade
2.
Open Res Eur ; 4: 110, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39091348

RESUMO

Large Language Models (LLMs) offer advanced text generation capabilities, sometimes surpassing human abilities. However, their use without proper expertise poses significant challenges, particularly in educational contexts. This article explores different facets of natural language generation (NLG) within the educational realm, assessing its advantages and disadvantages, particularly concerning LLMs. It addresses concerns regarding the opacity of LLMs and the potential bias in their generated content, advocating for transparent solutions. Therefore, it examines the feasibility of integrating OpenLogos expert-crafted resources into language generation tools used for paraphrasing and translation. In the context of the Multi3Generation COST Action (CA18231), we have been emphasizing the significance of incorporating OpenLogos into language generation processes, and the need for clear guidelines and ethical standards in generative models involving multilingual, multimodal, and multitasking capabilities. The Multi3Generation initiative strives to progress NLG research for societal welfare, including its educational applications. It promotes inclusive models inspired by the Logos Model, prioritizing transparency, human control, preservation of language principles and meaning, and acknowledgment of the expertise of resource creators. We envision a scenario where OpenLogos can contribute significantly to inclusive AI-supported education. Ethical considerations and limitations related to AI implementation in education are explored, highlighting the importance of maintaining a balanced approach consistent with traditional educational principles. Ultimately, the article advocates for educators to adopt innovative tools and methodologies to foster dynamic learning environments that facilitate linguistic development and growth.


Large Language Models boast advanced text generation quality and capabilities, often surpassing those of humans. However, they also pose significant challenges when used without proper expertise or care. In an educational context, the examination of language generation tools and their use by students is vital for establishing guidelines and a shared understanding of their ethical usage. This article explores several aspects of language generation within an educational context, and showcases the potential use of OpenLogos resources, provided within the framework of the Multi3Generation COST Action (CA18231) in language study and their integration into language learning tools, such as paraphrasing (monolingual) and translation (bilingual or multilingual). This article emphasizes the importance of leveraging OpenLogos in education, especially in language learning or language enhancement contexts. By embracing innovative tools and methodologies, educators can nurture a dynamic and enriching learning environment conducive to linguistic growth and development.

3.
J Med Internet Res ; 26: e60336, 2024 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-39094112

RESUMO

BACKGROUND: Discharge instructions are a key form of documentation and patient communication in the time of transition from the emergency department (ED) to home. Discharge instructions are time-consuming and often underprioritized, especially in the ED, leading to discharge delays and possibly impersonal patient instructions. Generative artificial intelligence and large language models (LLMs) offer promising methods of creating high-quality and personalized discharge instructions; however, there exists a gap in understanding patient perspectives of LLM-generated discharge instructions. OBJECTIVE: We aimed to assess the use of LLMs such as ChatGPT in synthesizing accurate and patient-accessible discharge instructions in the ED. METHODS: We synthesized 5 unique, fictional ED encounters to emulate real ED encounters that included a diverse set of clinician history, physical notes, and nursing notes. These were passed to GPT-4 in Azure OpenAI Service (Microsoft) to generate LLM-generated discharge instructions. Standard discharge instructions were also generated for each of the 5 unique ED encounters. All GPT-generated and standard discharge instructions were then formatted into standardized after-visit summary documents. These after-visit summaries containing either GPT-generated or standard discharge instructions were randomly and blindly administered to Amazon MTurk respondents representing patient populations through Amazon MTurk Survey Distribution. Discharge instructions were assessed based on metrics of interpretability of significance, understandability, and satisfaction. RESULTS: Our findings revealed that survey respondents' perspectives regarding GPT-generated and standard discharge instructions were significantly (P=.01) more favorable toward GPT-generated return precautions, and all other sections were considered noninferior to standard discharge instructions. Of the 156 survey respondents, GPT-generated discharge instructions were assigned favorable ratings, "agree" and "strongly agree," more frequently along the metric of interpretability of significance in discharge instruction subsections regarding diagnosis, procedures, treatment, post-ED medications or any changes to medications, and return precautions. Survey respondents found GPT-generated instructions to be more understandable when rating procedures, treatment, post-ED medications or medication changes, post-ED follow-up, and return precautions. Satisfaction with GPT-generated discharge instruction subsections was the most favorable in procedures, treatment, post-ED medications or medication changes, and return precautions. Wilcoxon rank-sum test of Likert responses revealed significant differences (P=.01) in the interpretability of significant return precautions in GPT-generated discharge instructions compared to standard discharge instructions but not for other evaluation metrics and discharge instruction subsections. CONCLUSIONS: This study demonstrates the potential for LLMs such as ChatGPT to act as a method of augmenting current documentation workflows in the ED to reduce the documentation burden of physicians. The ability of LLMs to provide tailored instructions for patients by improving readability and making instructions more applicable to patients could improve upon the methods of communication that currently exist.


Assuntos
Serviço Hospitalar de Emergência , Alta do Paciente , Humanos , Serviço Hospitalar de Emergência/estatística & dados numéricos , Alta do Paciente/estatística & dados numéricos , Feminino , Masculino , Inquéritos e Questionários , Adulto , Pessoa de Meia-Idade , Inteligência Artificial
4.
Br J Psychol ; 2024 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-39095975

RESUMO

Recent years have witnessed some rapid and tremendous progress in natural language processing (NLP) techniques that are used to analyse text data. This study endeavours to offer an up-to-date review of NLP applications by examining their use in counselling and psychotherapy from 1990 to 2021. The purpose of this scoping review is to identify trends, advancements, challenges and limitations of these applications. Among the 41 papers included in this review, 4 primary study purposes were identified: (1) developing automated coding; (2) predicting outcomes; (3) monitoring counselling sessions; and (4) investigating language patterns. Our findings showed a growing trend in the number of papers utilizing advanced machine learning methods, particularly neural networks. Unfortunately, only a third of the articles addressed the issues of bias and generalizability. Our findings provided a timely systematic update, shedding light on concerns related to bias, generalizability and validity in the context of NLP applications in counselling and psychotherapy.

5.
Int J Med Inform ; 191: 105580, 2024 Jul 31.
Artigo em Inglês | MEDLINE | ID: mdl-39096594

RESUMO

INTRODUCTION: Radiology scoring systems are critical to the success of lung cancer screening (LCS) programs, impacting patient care, adherence to follow-up, data management and reporting, and program evaluation. LungCT ScreeningReporting and Data System (Lung-RADS) is a structured radiology scoring system that provides recommendations for LCS follow-up that are utilized (a) in clinical care and (b) by LCS programs monitoring rates of adherence to follow-up. Thus, accurate reporting and reliable collection of Lung-RADS scores are fundamental components of LCS program evaluation and improvement. Unfortunately, due to variability in radiology reports, extraction of Lung-RADS scores is non-trivial, and best practices do not exist. The purpose of this project is to compare mechanisms to extract Lung-RADS scores from free-text radiology reports. METHODS: We retrospectively analyzed reports of LCS low-dose computed tomography (LDCT) examinations performed at a multihospital integrated healthcare network in New York State between January 2016 and July 2023. We compared three methods of Lung-RADS score extraction: manual physician entry at time of report creation, manual LCS specialist entry after report creation, and an internally developed, rule-based natural language processing (NLP) algorithm. Accuracy, recall, precision, and completeness (i.e., the proportion of LCS exams to which a Lung-RADS score has been assigned) were compared between the three methods. RESULTS: The dataset includes 24,060 LCS examinations on 14,243 unique patients. The mean patient age was 65 years, and most patients were male (54 %) and white (75 %). Completeness rate was 65 %, 68 %, and 99 % for radiologists' manual entry, LCS specialists' entry, and NLP algorithm, respectively. Accuracy, recall, and precision were high across all extraction methods (>94 %), though the NLP-based approach was consistently higher than both manual entries in all metrics. DISCUSSION: An NLP-based method of LCS score determination is an efficient and more accurate means of extracting Lung-RADS scores than manual review and data entry. NLP-based methods should be considered best practice for extracting structured Lung-RADS scores from free-text radiology reports.

6.
Neuron ; 2024 Jul 18.
Artigo em Inglês | MEDLINE | ID: mdl-39096896

RESUMO

Effective communication hinges on a mutual understanding of word meaning in different contexts. We recorded brain activity using electrocorticography during spontaneous, face-to-face conversations in five pairs of epilepsy patients. We developed a model-based coupling framework that aligns brain activity in both speaker and listener to a shared embedding space from a large language model (LLM). The context-sensitive LLM embeddings allow us to track the exchange of linguistic information, word by word, from one brain to another in natural conversations. Linguistic content emerges in the speaker's brain before word articulation and rapidly re-emerges in the listener's brain after word articulation. The contextual embeddings better capture word-by-word neural alignment between speaker and listener than syntactic and articulatory models. Our findings indicate that the contextual embeddings learned by LLMs can serve as an explicit numerical model of the shared, context-rich meaning space humans use to communicate their thoughts to one another.

7.
Farm Hosp ; 48 Suppl 1: S35-S44, 2024 Jul.
Artigo em Inglês, Espanhol | MEDLINE | ID: mdl-39097366

RESUMO

Artificial intelligence (AI) is a broad concept that includes the study of the ability of computers to perform tasks that would normally require the intervention of human intelligence. By exploiting large volumes of healthcare data, artificial intelligence algorithms can identify patterns and predict outcomes, which can help healthcare organizations and their professionals make better decisions and achieve better results. Machine learning, deep learning, neural networks or natural language processing are among the most important methods, allowing systems to learn and improve from data without the need for explicit programming. AI has been introduced in biomedicine, accelerating processes, improving safety and efficiency, and improving patient care. By using AI algorithms and Machine Learning, hospital pharmacists can analyze a large volume of patient data, including medical records, laboratory results, and medication profiles, aiding them in identifying potential drug-drug interactions, assessing the safety and efficacy of medicines, and making informed recommendations. AI integration will improve the quality of pharmaceutical care, optimize processes, promote research, deploy open innovation, and facilitate education. Hospital pharmacists who master AI will play a crucial role in this transformation.


Assuntos
Inteligência Artificial , Serviço de Farmácia Hospitalar , Serviço de Farmácia Hospitalar/organização & administração , Humanos , Farmacêuticos , Algoritmos , Aprendizado de Máquina , Redes Neurais de Computação
8.
Farm Hosp ; 48 Suppl 1: TS35-TS44, 2024 Jul.
Artigo em Inglês, Espanhol | MEDLINE | ID: mdl-39097375

RESUMO

Artificial intelligence is a broad concept that includes the study of the ability of computers to perform tasks that would normally require the intervention of human intelligence. By exploiting large volumes of healthcare data, Artificial intelligence algorithms can identify patterns and predict outcomes, which can help healthcare organizations and their professionals make better decisions and achieve better results. Machine learning, deep learning, neural networks, or natural language processing are among the most important methods, allowing systems to learn and improve from data without the need for explicit programming. Artificial intelligence has been introduced in biomedicine, accelerating processes, improving accuracy and efficiency, and improving patient care. By using Artificial intelligence algorithms and machine learning, hospital pharmacists can analyze a large volume of patient data, including medical records, laboratory results, and medication profiles, aiding them in identifying potential drug-drug interactions, assessing the safety and efficacy of medicines, and making informed recommendations. Artificial intelligence integration will improve the quality of pharmaceutical care, optimize processes, promote research, deploy open innovation, and facilitate education. Hospital pharmacists who master Artificial intelligence will play a crucial role in this transformation.


Assuntos
Inteligência Artificial , Serviço de Farmácia Hospitalar , Serviço de Farmácia Hospitalar/organização & administração , Humanos , Farmacêuticos , Algoritmos , Aprendizado de Máquina
9.
Insights Imaging ; 15(1): 186, 2024 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-39090273

RESUMO

OBJECTIVE: To evaluate whether and how the radiological journals present their policies on the use of large language models (LLMs), and identify the journal characteristic variables that are associated with the presence. METHODS: In this meta-research study, we screened Journals from the Radiology, Nuclear Medicine and Medical Imaging Category, 2022 Journal Citation Reports, excluding journals in non-English languages and relevant documents unavailable. We assessed their LLM use policies: (1) whether the policy is present; (2) whether the policy for the authors, the reviewers, and the editors is present; and (3) whether the policy asks the author to report the usage of LLMs, the name of LLMs, the section that used LLMs, the role of LLMs, the verification of LLMs, and the potential influence of LLMs. The association between the presence of policies and journal characteristic variables was evaluated. RESULTS: The LLM use policies were presented in 43.9% (83/189) of journals, and those for the authors, the reviewers, and the editor were presented in 43.4% (82/189), 29.6% (56/189) and 25.9% (49/189) of journals, respectively. Many journals mentioned the aspects of the usage (43.4%, 82/189), the name (34.9%, 66/189), the verification (33.3%, 63/189), and the role (31.7%, 60/189) of LLMs, while the potential influence of LLMs (4.2%, 8/189), and the section that used LLMs (1.6%, 3/189) were seldomly touched. The publisher is related to the presence of LLM use policies (p < 0.001). CONCLUSION: The presence of LLM use policies is suboptimal in radiological journals. A reporting guideline is encouraged to facilitate reporting quality and transparency. CRITICAL RELEVANCE STATEMENT: It may facilitate the quality and transparency of the use of LLMs in scientific writing if a shared complete reporting guideline is developed by stakeholders and then endorsed by journals. KEY POINTS: The policies on LLM use in radiological journals are unexplored. Some of the radiological journals presented policies on LLM use. A shared complete reporting guideline for LLM use is desired.

10.
Stud Health Technol Inform ; 315: 373-378, 2024 Jul 24.
Artigo em Inglês | MEDLINE | ID: mdl-39049286

RESUMO

Hospital-acquired falls are a continuing clinical concern. The emergence of advanced analytical methods, including NLP, has created opportunities to leverage nurse-generated data, such as clinical notes, to better address the problem of falls. In this nurse-driven study, we employed an iterative process for expert manual annotation of RNs clinical notes to enable the training and testing of an NLP pipeline to extract factors related to falls. The resulting annotated data corpus had moderately high interrater reliability (F-score=0.74) and captured a breadth of clinical concepts for extraction with potential utility beyond patient falls. Further research is needed to determine which annotation tasks most benefit from nursing expert annotators, to optimize efficiency when tapping into the invaluable resource represented by the nursing workforce.


Assuntos
Acidentes por Quedas , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Acidentes por Quedas/prevenção & controle , Humanos , Fatores de Risco , Registros de Enfermagem , Mineração de Dados/métodos , Medição de Risco
11.
JAMIA Open ; 7(3): ooae054, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39049992

RESUMO

Objective: Surgical registries play a crucial role in clinical knowledge discovery, hospital quality assurance, and quality improvement. However, maintaining a surgical registry requires significant monetary and human resources given the wide gamut of information abstracted from medical records ranging from patient co-morbidities to procedural details to post-operative outcomes. Although natural language processing (NLP) methods such as pretrained language models (PLMs) have promised automation of this process, there are yet substantial barriers to implementation. In particular, constant shifts in both underlying data and required registry content are hurdles to the application of NLP technologies. Materials and Methods: In our work, we evaluate the application of PLMs for automating the population of the Society of Thoracic Surgeons (STSs) adult cardiac surgery registry (ACS) procedural elements, for which we term Cardiovascular Surgery Bidirectional Encoder Representations from Transformers (CS-BERT). CS-BERT was validated across multiple satellite sites and versions of the STS-ACS registry. Results: CS-BERT performed well (F1 score of 0.8417 ± 0.1838) in common cardiac surgery procedures compared to models based on diagnosis codes (F1 score of 0.6130 ± 0.0010). The model also generalized well to satellite sites and across different versions of the STS-ACS registry. Discussion and Conclusions: This study provides evidence that PLMs can be used to extract the more common cardiac surgery procedure variables in the STS-ACS registry, potentially reducing need for expensive human annotation and wide scale dissemination. Further research is needed for rare procedural variables which suffer from both lack of data and variable documentation quality.

12.
EClinicalMedicine ; 73: 102692, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-39050586

RESUMO

Background: Artificial intelligence deployed to triage patients post-cataract surgery could help to identify and prioritise individuals who need clinical input and to expand clinical capacity. This study investigated the accuracy and safety of an autonomous telemedicine call (Dora, version R1) in detecting cataract surgery patients who need further management and compared its performance against ophthalmic specialists. Methods: 225 participants were recruited from two UK public teaching hospitals after routine cataract surgery between 17 September 2021 and 31 January 2022. Eligible patients received a call from Dora R1 to conduct a follow-up assessment approximately 3 weeks post cataract surgery, which was supervised in real-time by an ophthalmologist. The primary analysis compared decisions made independently by Dora R1 and the supervising ophthalmologist about the clinical significance of five symptoms and whether the patient required further review. Secondary analyses used mixed methods to examine Dora R1's usability and acceptability and to assess cost impact compared to standard care. This study is registered with ClinicalTrials.gov (NCT05213390) and ISRCTN (16038063). Findings: 202 patients were included in the analysis, with data collection completed on 23 March 2022. Dora R1 demonstrated an overall outcome sensitivity of 94% and specificity of 86% and showed moderate to strong agreement (kappa: 0.758-0.970) with clinicians in all parameters. Safety was validated by assessing subsequent outcomes: 11 of the 117 patients (9%) recommended for discharge by Dora R1 had unexpected management changes, but all were also recommended for discharge by the supervising clinician. Four patients were recommended for discharge by Dora R1 but not the clinician; none required further review on callback. Acceptability, from interviews with 20 participants, was generally good in routine circumstances but patients were concerned about the lack of a 'human element' in cases with complications. Feasibility was demonstrated by the high proportion of calls completed autonomously (195/202, 96.5%). Staff cost benefits for Dora R1 compared to standard care were £35.18 per patient. Interpretation: The composite of mixed methods analysis provides preliminary evidence for the safety, acceptability, feasibility, and cost benefits for clinical adoption of an artificial intelligence conversational agent, Dora R1, to conduct follow-up assessment post-cataract surgery. Further evaluation in real-world implementation should be conducted to provide additional evidence around safety and effectiveness in a larger sample from a more diverse set of Trusts. Funding: This manuscript is independent research funded by the National Institute for Health Research and NHSX (Artificial Intelligence in Health and Care Award, AI_AWARD01852).

13.
JACC Adv ; 3(8): 101064, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-39050815

RESUMO

Background: Heart failure with preserved ejection fraction (HFpEF) is the predominant form of HF in older adults. It represents a heterogenous clinical syndrome that is less well understood across different ethnicities. Objectives: This study aimed to compare the clinical presentation and assess the diagnostic performance of existing HFpEF diagnostic tools between ethnic groups. Methods: A validated Natural Language Processing (NLP) algorithm was applied to the electronic health records of a large London hospital to identify patients meeting the European Society of Cardiology criteria for a diagnosis of HFpEF. NLP extracted patient demographics (including self-reported ethnicity and socioeconomic status), comorbidities, investigation results (N-terminal pro-B-type natriuretic peptide, H2FPEF scores, and echocardiogram reports), and mortality. Analyses were stratified by ethnicity and adjusted for socioeconomic status. Results: Our cohort consisted of 1,261 (64%) White, 578 (29%) Black, and 134 (7%) Asian patients meeting the European Society of Cardiology HFpEF diagnostic criteria. Compared to White patients, Black patients were younger at diagnosis and more likely to have metabolic comorbidities (obesity, diabetes, and hypertension) but less likely to have atrial fibrillation (30% vs 13%; P < 0.001). Black patients had lower N-terminal pro-B-type natriuretic peptide levels and a lower frequency of H2FPEF scores ≥6, indicative of likely HFpEF (26% vs 44%; P < 0.0001). Conclusions: Leveraging an NLP-based artificial intelligence approach to quantify health inequities in HFpEF diagnosis, we discovered that established markers systematically underdiagnose HFpEF in Black patients, possibly due to differences in the underlying comorbidity patterns. Clinicians should be aware of these limitations and its implications for treatment and trial recruitment.

14.
J Med Internet Res ; 26: e60807, 2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39052324

RESUMO

BACKGROUND: Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing examinations. OBJECTIVE: In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education. METHODS: We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses. RESULTS: A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5's performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5's (P=.03) and GPT-4's (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT's accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs. CONCLUSIONS: GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education. TRIAL REGISTRATION: PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687.


Assuntos
Avaliação Educacional , Licenciamento em Medicina , Humanos , Licenciamento em Medicina/normas , Licenciamento em Medicina/estatística & dados numéricos , Avaliação Educacional/métodos , Avaliação Educacional/normas , Avaliação Educacional/estatística & dados numéricos , Competência Clínica/estatística & dados numéricos , Competência Clínica/normas , Inteligência Artificial , Educação Médica/normas
15.
J Med Internet Res ; 26: e59050, 2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39052327

RESUMO

BACKGROUND: Data analysis approaches such as qualitative content analysis are notoriously time and labor intensive because of the time to detect, assess, and code a large amount of data. Tools such as ChatGPT may have tremendous potential in automating at least some of the analysis. OBJECTIVE: The aim of this study was to explore the utility of ChatGPT in conducting qualitative content analysis through the analysis of forum posts from people sharing their experiences on reducing their sugar consumption. METHODS: Inductive and deductive content analysis were performed on 537 forum posts to detect mechanisms of behavior change. Thorough prompt engineering provided appropriate instructions for ChatGPT to execute data analysis tasks. Data identification involved extracting change mechanisms from a subset of forum posts. The precision of the extracted data was assessed through comparison with human coding. On the basis of the identified change mechanisms, coding schemes were developed with ChatGPT using data-driven (inductive) and theory-driven (deductive) content analysis approaches. The deductive approach was informed by the Theoretical Domains Framework using both an unconstrained coding scheme and a structured coding matrix. In total, 10 coding schemes were created from a subset of data and then applied to the full data set in 10 new conversations, resulting in 100 conversations each for inductive and unconstrained deductive analysis. A total of 10 further conversations coded the full data set into the structured coding matrix. Intercoder agreement was evaluated across and within coding schemes. ChatGPT output was also evaluated by the researchers to assess whether it reflected prompt instructions. RESULTS: The precision of detecting change mechanisms in the data subset ranged from 66% to 88%. Overall κ scores for intercoder agreement ranged from 0.72 to 0.82 across inductive coding schemes and from 0.58 to 0.73 across unconstrained coding schemes and structured coding matrix. Coding into the best-performing coding scheme resulted in category-specific κ scores ranging from 0.67 to 0.95 for the inductive approach and from 0.13 to 0.87 for the deductive approaches. ChatGPT largely followed prompt instructions in producing a description of each coding scheme, although the wording for the inductively developed coding schemes was lengthier than specified. CONCLUSIONS: ChatGPT appears fairly reliable in assisting with qualitative analysis. ChatGPT performed better in developing an inductive coding scheme that emerged from the data than adapting an existing framework into an unconstrained coding scheme or coding directly into a structured matrix. The potential for ChatGPT to act as a second coder also appears promising, with almost perfect agreement in at least 1 coding scheme. The findings suggest that ChatGPT could prove useful as a tool to assist in each phase of qualitative content analysis, but multiple iterations are required to determine the reliability of each stage of analysis.


Assuntos
Pesquisa Qualitativa , Humanos
16.
Comput Methods Programs Biomed ; 255: 108334, 2024 Jul 20.
Artigo em Inglês | MEDLINE | ID: mdl-39053353

RESUMO

BACKGROUND AND OBJECTIVES: In the last decade, there has been a growing interest in applying artificial intelligence (AI) systems to breast cancer assessment, including breast density evaluation. However, few models have been developed to integrate textual mammographic reports and mammographic images. Our aims are (1) to generate a natural language processing (NLP)-based AI system, (2) to evaluate an external image-based software, and (3) to develop a multimodal system, using the late fusion approach, by integrating image and text inferences for the automatic classification of breast density according to the American College of Radiology (ACR) guidelines in mammograms and radiological reports. METHODS: We first compared different NLP models, three based on n-gram term frequency - inverse document frequency and two transformer-based architectures, using 1533 unstructured mammogram reports as a training set and 303 reports as a test set. Subsequently, we evaluated an external image-based software using 303 mammogram images. Finally, we assessed our multimodal system taking into account both text and mammogram images. RESULTS: Our best NLP model achieved 88 % accuracy, while the external software and the multimodal system achieved 75 % and 80 % accuracy, respectively, in classifying ACR breast densities. CONCLUSION: Although our multimodal system outperforms the image-based tool, it currently does not improve the results offered by the NLP model for ACR breast density classification. Nevertheless, the promising results observed here open the possibility to more comprehensive studies regarding the utilization of multimodal tools in the assessment of breast density.

17.
Am J Epidemiol ; 2024 Jul 26.
Artigo em Inglês | MEDLINE | ID: mdl-39060160

RESUMO

Fall-related injuries (FRIs) are a major cause of hospitalizations among older patients, but identifying them in unstructured clinical notes poses challenges for large-scale research. In this study, we developed and evaluated Natural Language Processing (NLP) models to address this issue. We utilized all available clinical notes from the Mass General Brigham for 2,100 older adults, identifying 154,949 paragraphs of interest through automatic scanning for FRI-related keywords. Two clinical experts directly labeled 5,000 paragraphs to generate benchmark-standard labels, while 3,689 validated patterns were annotated, indirectly labeling 93,157 paragraphs as validated-standard labels. Five NLP models, including vanilla BERT, RoBERTa, Clinical-BERT, Distil-BERT, and SVM, were trained using 2,000 benchmark paragraphs and all validated paragraphs. BERT-based models were trained in three stages: Masked Language Modeling, General Boolean Question Answering (QA), and QA for FRI. For validation, 500 benchmark paragraphs were used, and the remaining 2,500 for testing. Performance metrics (precision, recall, F1 scores, Area Under ROC [AUROC] or Precision-Recall [AUPR] curves) were employed by comparison, with RoBERTa showing the best performance. Precision was 0.90 [0.88-0.91], recall [0.90-0.93], F1 score 0.90 [0.89-0.92], AUROC and AUPR curves of 0.96 [0.95-0.97]. These NLP models accurately identify FRIs from unstructured clinical notes, potentially enhancing clinical notes-based research efficiency.

18.
J Affect Disord ; 363: 340-347, 2024 Jul 17.
Artigo em Inglês | MEDLINE | ID: mdl-39029695

RESUMO

BACKGROUND: In recent years, automated analyses using novel NLP methods have been used to investigate language abnormalities in schizophrenia. In contrast, only a few studies used automated language analyses in bipolar disorder. To our knowledge, no previous research compared automated language characteristics of first-episode psychosis (FEP) and bipolar disorder (FEBD) using NLP methods. METHODS: Our study included 53 FEP, 40 FEBD and 50 healthy control participants who are native Turkish speakers. Speech samples of the participants in the Thematic Apperception Test (TAT) underwent automated generic and part-of-speech analyses, as well as sentence-level semantic similarity analysis based on SBERT. RESULTS: Both FEBD and FEP were associated with the use of shorter sentences and increased sentence-level semantic similarity but less semantic alignment with the TAT pictures. FEP also demonstrated reduced verbosity and syntactic complexity. FEP differed from FEBD in reduced verbosity, decreased first-person singular pronouns, fewer conjunctions, increased semantic similarity as well as shorter sentence and word length. The mean classification accuracy was 82.45 % in FEP vs HC, 71.1 % in FEBD vs HC, and 73 % in FEP vs FEBD. After Bonferroni correction, the severity of negative symptoms in FEP was associated with reduced verbal output and increased 5th percentile of semantic similarity. LIMITATIONS: The main limitation of this study was the cross-sectional nature. CONCLUSION: Our findings demonstrate that both patient groups showed language abnormalities, which were more severe and widespread in FEP compared to FEBD. Our results suggest that NLP methods reveal transdiagnostic linguistic abnormalities in FEP and FEBD.

19.
JMIR Med Inform ; 12: e58141, 2024 Jul 23.
Artigo em Inglês | MEDLINE | ID: mdl-39042454

RESUMO

BACKGROUND: Medication safety in residential care facilities is a critical concern, particularly when nonmedical staff provide medication assistance. The complex nature of medication-related incidents in these settings, coupled with the psychological impact on health care providers, underscores the need for effective incident analysis and preventive strategies. A thorough understanding of the root causes, typically through incident-report analysis, is essential for mitigating medication-related incidents. OBJECTIVE: We aimed to develop and evaluate a multilabel classifier using natural language processing to identify factors contributing to medication-related incidents using incident report descriptions from residential care facilities, with a focus on incidents involving nonmedical staff. METHODS: We analyzed 2143 incident reports, comprising 7121 sentences, from residential care facilities in Japan between April 1, 2015, and March 31, 2016. The incident factors were annotated using sentences based on an established organizational factor model and previous research findings. The following 9 factors were defined: procedure adherence, medicine, resident, resident family, nonmedical staff, medical staff, team, environment, and organizational management. To assess the label criteria, 2 researchers with relevant medical knowledge annotated a subset of 50 reports; the interannotator agreement was measured using Cohen κ. The entire data set was subsequently annotated by 1 researcher. Multiple labels were assigned to each sentence. A multilabel classifier was developed using deep learning models, including 2 Bidirectional Encoder Representations From Transformers (BERT)-type models (Tohoku-BERT and a University of Tokyo Hospital BERT pretrained with Japanese clinical text: UTH-BERT) and an Efficiently Learning Encoder That Classifies Token Replacements Accurately (ELECTRA), pretrained on Japanese text. Both sentence- and report-level training were performed; the performance was evaluated by the F1-score and exact match accuracy through 5-fold cross-validation. RESULTS: Among all 7121 sentences, 1167, 694, 2455, 23, 1905, 46, 195, 1104, and 195 included "procedure adherence," "medicine," "resident," "resident family," "nonmedical staff," "medical staff," "team," "environment," and "organizational management," respectively. Owing to limited labels, "resident family" and "medical staff" were omitted from the model development process. The interannotator agreement values were higher than 0.6 for each label. A total of 10, 278, and 1855 reports contained no, 1, and multiple labels, respectively. The models trained using the report data outperformed those trained using sentences, with macro F1-scores of 0.744, 0.675, and 0.735 for Tohoku-BERT, UTH-BERT, and ELECTRA, respectively. The report-trained models also demonstrated better exact match accuracy, with 0.411, 0.389, and 0.399 for Tohoku-BERT, UTH-BERT, and ELECTRA, respectively. Notably, the accuracy was consistent even when the analysis was confined to reports containing multiple labels. CONCLUSIONS: The multilabel classifier developed in our study demonstrated potential for identifying various factors associated with medication-related incidents using incident reports from residential care facilities. Thus, this classifier can facilitate prompt analysis of incident factors, thereby contributing to risk management and the development of preventive strategies.

20.
JMIR Med Educ ; 10: e52818, 2024 Jul 23.
Artigo em Inglês | MEDLINE | ID: mdl-39042876

RESUMO

BACKGROUND: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. OBJECTIVE: This study aimed to evaluate ChatGPT's performance in a pulmonology examination through a comparative analysis with that of third-year medical students. METHODS: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution's 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. RESULTS: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. CONCLUSIONS: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources.


Assuntos
Avaliação Educacional , Pneumologia , Estudantes de Medicina , Humanos , Estudos Transversais , Pneumologia/educação , Estudantes de Medicina/estatística & dados numéricos , Avaliação Educacional/métodos , Educação de Graduação em Medicina/métodos , Masculino , Aptidão , Feminino , Competência Clínica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...