Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 490
Filtrar
1.
Cureus ; 16(8): e68298, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-39350878

RESUMO

GPT-4 Vision (GPT-4V) represents a significant advancement in multimodal artificial intelligence, enabling text generation from images without specialized training. This marks the transformation of ChatGPT as a large language model (LLM) into GPT-4's promised large multimodal model (LMM). As these AI models continue to advance, they may enhance radiology workflow and aid with decision support. This technical note explores potential GPT-4V applications in radiology and evaluates performance for sample tasks. GPT-4V capabilities were tested using images from the web, personal and institutional teaching files, and hand-drawn sketches. Prompts evaluated scientific figure analysis, radiologic image reporting, image comparison, handwriting interpretation, sketch-to-code, and artistic expression. In this limited demonstration of GPT-4V's capabilities, it showed promise in classifying images, counting entities, comparing images, and deciphering handwriting and sketches. However, it exhibited limitations in detecting some fractures, discerning a change in size of lesions, accurately interpreting complex diagrams, and consistently characterizing radiologic findings. Artistic expression responses were coherent. WhileGPT-4V may eventually assist with tasks related to radiology, current reliability gaps highlight the need for continued training and improvement before consideration for any medical use by the general public and ultimately clinical integration. Future iterations could enable a virtual assistant to discuss findings, improve reports, extract data from images, provide decision support based on guidelines, white papers, and appropriateness criteria. Human expertise remain essential for safe practice and partnerships between physicians, researchers, and technology leaders are necessary to safeguard against risks like bias and privacy concerns.

2.
Cell Mol Bioeng ; 17(4): 263-277, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-39372551

RESUMO

Objectives: This review explores the potential applications of large language models (LLMs) such as ChatGPT, GPT-3.5, and GPT-4 in the medical field, aiming to encourage their prudent use, provide professional support, and develop accessible medical AI tools that adhere to healthcare standards. Methods: This paper examines the impact of technologies such as OpenAI's Generative Pre-trained Transformers (GPT) series, including GPT-3.5 and GPT-4, and other large language models (LLMs) in medical education, scientific research, clinical practice, and nursing. Specifically, it includes supporting curriculum design, acting as personalized learning assistants, creating standardized simulated patient scenarios in education; assisting with writing papers, data analysis, and optimizing experimental designs in scientific research; aiding in medical imaging analysis, decision-making, patient education, and communication in clinical practice; and reducing repetitive tasks, promoting personalized care and self-care, providing psychological support, and enhancing management efficiency in nursing. Results: LLMs, including ChatGPT, have demonstrated significant potential and effectiveness in the aforementioned areas, yet their deployment in healthcare settings is fraught with ethical complexities, potential lack of empathy, and risks of biased responses. Conclusion: Despite these challenges, significant medical advancements can be expected through the proper use of LLMs and appropriate policy guidance. Future research should focus on overcoming these barriers to ensure the effective and ethical application of LLMs in the medical field.

3.
J Stomatol Oral Maxillofac Surg ; : 102114, 2024 Oct 09.
Artigo em Inglês | MEDLINE | ID: mdl-39389541

RESUMO

OBJECTIVE: The purpose of this study is to evaluate the performance of Scholar GPT in answering technical questions in the field of oral and maxillofacial surgery and to conduct a comparative analysis with the results of a previous study that assessed the performance of ChatGPT. MATERIALS AND METHODS: Scholar GPT was accessed via ChatGPT (www.chatgpt.com) on March 20, 2024. A total of 60 technical questions (15 each on impacted teeth, dental implants, temporomandibular joint disorders, and orthognathic surgery) from our previous study were used. Scholar GPT's responses were evaluated using a modified Global Quality Scale (GQS). The questions were randomized before scoring using an online randomizer (www.randomizer.org). A single researcher performed the evaluations at three different times, three weeks apart, with each evaluation preceded by a new randomization. In cases of score discrepancies, a fourth evaluation was conducted to determine the final score. RESULTS: Scholar GPT performed well across all technical questions, with an average GQS score of 4.48 (SD=0.93). Comparatively, ChatGPT's average GQS score in previous study was 3.1 (SD=1.492). The Wilcoxon Signed-Rank Test indicated a statistically significant higher average score for Scholar GPT compared to ChatGPT (Mean Difference = 2.00, SE = 0.163, p < 0.001). The Kruskal-Wallis Test showed no statistically significant differences among the topic groups (χ² = 0.799, df = 3, p = 0.850, ε² = 0.0135). CONCLUSION: Scholar GPT demonstrated a generally high performance in technical questions within oral and maxillofacial surgery and produced more consistent and higher-quality responses compared to ChatGPT. The findings suggest that GPT models based on academic databases can provide more accurate and reliable information. Additionally, developing a specialized GPT model for oral and maxillofacial surgery could ensure higher quality and consistency in artificial intelligence-generated information.

4.
Sci Rep ; 14(1): 23285, 2024 10 07.
Artigo em Inglês | MEDLINE | ID: mdl-39375385

RESUMO

This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed n = 300 data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8)-a statistically significant difference (p = 2.482 × 10 - 10 ). Prompt engineering significantly improved scores for both GPT-4 (p = 1.661 × 10 - 4 ) and GPT-3.5 (p = 4.967 × 10 - 9 ). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from 'Definitely AI' to 'Definitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary 'AI' or 'Human' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.


Assuntos
Estudantes , Humanos , Universidades
5.
JMIR Med Inform ; 12: e64143, 2024 Sep 30.
Artigo em Inglês | MEDLINE | ID: mdl-39365849

RESUMO

Unlabelled: Cardiovascular drug development requires synthesizing relevant literature about indications, mechanisms, biomarkers, and outcomes. This short study investigates the performance, cost, and prompt engineering trade-offs of 3 large language models accelerating the literature screening process for cardiovascular drug development applications.


Assuntos
Desenvolvimento de Medicamentos , Estudos Transversais , Humanos , Desenvolvimento de Medicamentos/métodos , Fármacos Cardiovasculares/uso terapêutico , Indexação e Redação de Resumos , Doenças Cardiovasculares/tratamento farmacológico , Processamento de Linguagem Natural
6.
Artigo em Inglês | MEDLINE | ID: mdl-39238375

RESUMO

Predictions are made by artificial intelligence, especially through machine learning, which uses algorithms and past knowledge. Notably, there has been an increase in interest in using artificial intelligence, particularly generative AI, in the pharmacovigilance of pharmaceuticals under development, as well as those already in the market. This review was conducted to understand how generative AI can play an important role in pharmacovigilance and improving drug safety monitoring. Data from previously published articles and news items were reviewed in order to obtain information. We used PubMed and Google Scholar as our search engines, and keywords (pharmacovigilance, artificial intelligence, machine learning, drug safety, and patient safety) were used. In toto, we reviewed 109 articles published till 31 January 2024, and the obtained information was interpreted, compiled, evaluated, and conclusions were reached. Generative AI has transformative potential in pharmacovigilance, showcasing benefits, such as enhanced adverse event detection, data-driven risk prediction, and optimized drug development. By making it easier to process and analyze big datasets, generative artificial intelligence has applications across a variety of disease states. Machine learning and automation in this field can streamline pharmacovigilance procedures and provide a more efficient way to assess safety-related data. Nevertheless, more investigation is required to determine how this optimization affects the caliber of safety analyses. In the near future, the increased utilization of artificial intelligence is anticipated, especially in predicting side effects and Adverse Drug Reactions (ADRs).

7.
Semin Vasc Surg ; 37(3): 342-349, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39277351

RESUMO

Virtual assistants, broadly defined as digital services designed to simulate human conversation and provide personalized responses based on user input, have the potential to improve health care by supporting clinicians and patients in terms of diagnosing and managing disease, performing administrative tasks, and supporting medical research and education. These tasks are particularly helpful in vascular surgery, where the clinical and administrative burden is high due to the rising incidence of vascular disease, the medical complexity of the patients, and the potential for innovation and care advancement. The rapid development of artificial intelligence, machine learning, and natural language processing techniques have facilitated the training of large language models, such as GPT-4 (OpenAI), which can support the development of increasingly powerful virtual assistants. These tools may support holistic, multidisciplinary, and high-quality vascular care delivery throughout the pre-, intra-, and postoperative stages. Importantly, it is critical to consider the design, safety, and challenges related to virtual assistants, including data security, ethical, and equity concerns. By combining the perspectives of patients, clinicians, data scientists, and other stakeholders when developing, implementing, and monitoring virtual assistants, there is potential to harness the power of this technology to care for vascular surgery patients more effectively. In this comprehensive review article, we introduce the concept of virtual assistants, describe potential applications of virtual assistants in vascular surgery for clinicians and patients, highlight the benefits and drawbacks of large language models, such as GPT-4, and discuss considerations around the design, safety, and challenges associated with virtual assistants in vascular surgery.


Assuntos
Procedimentos Cirúrgicos Vasculares , Humanos , Procedimentos Cirúrgicos Vasculares/efeitos adversos , Cirurgiões/educação , Prestação Integrada de Cuidados de Saúde/organização & administração , Doenças Vasculares/cirurgia , Doenças Vasculares/diagnóstico , Doenças Vasculares/diagnóstico por imagem
8.
Healthcare (Basel) ; 12(17)2024 Aug 30.
Artigo em Inglês | MEDLINE | ID: mdl-39273750

RESUMO

Given the widespread application of ChatGPT, we aim to evaluate its proficiency in the emergency medicine specialty written examination. Additionally, we compare the performance of GPT-3.5, GPT-4, GPTs, and GPT-4o. The research seeks to ascertain whether custom GPTs possess the essential capabilities and access to knowledge bases necessary for providing accurate information, and to explore the effectiveness and potential of personalized knowledge bases in supporting the education of medical residents. We evaluated the performance of ChatGPT-3.5, GPT-4, custom GPTs, and GPT-4o on the Emergency Medicine Specialist Examination in Taiwan. Two hundred single-choice exam questions were provided to these AI models, and their responses were recorded. Correct rates were compared among the four models, and the McNemar test was applied to paired model data to determine if there were significant changes in performance. Out of 200 questions, GPT-3.5, GPT-4, custom GPTs, and GPT-4o correctly answered 77, 105, 119, and 138 questions, respectively. GPT-4o demonstrated the highest performance, significantly better than GPT-4, which, in turn, outperformed GPT-3.5, while custom GPTs exhibited superior performance compared to GPT-4 but inferior performance compared to GPT-4o, with all p < 0.05. In the emergency medicine specialty written exam, our findings highlight the value and potential of large language models (LLMs), and highlight their strengths and limitations, especially in question types and image-inclusion capabilities. Not only do GPT-4o and custom GPTs facilitate exam preparation, but they also elevate the evidence level in responses and source accuracy, demonstrating significant potential to transform educational frameworks and clinical practices in medicine.

10.
JMIR Med Inform ; 12: e59258, 2024 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-39230947

RESUMO

BACKGROUND: Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed. OBJECTIVE: This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study. METHODS: The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs' understanding of different sections of a research paper. RESULTS: LLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs (P<.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper-with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding. CONCLUSIONS: This study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models.

11.
Stud Health Technol Inform ; 318: 18-23, 2024 Sep 24.
Artigo em Inglês | MEDLINE | ID: mdl-39320175

RESUMO

While Fast Healthcare Interoperability Resources (FHIR) clinical terminology server enables quick and easy search and retrieval of coded medical data, it still has some drawbacks. When searching, any typographical errors, variations in word forms, or deviations in word sequence might lead to incorrect search outcomes. For retrieval, queries to the server must strictly follow the FHIR application programming interface format, which requires users to know the syntax and remember the attribute codes they wish to retrieve. To improve its functionalities, a natural language interface was built, that harnesses the capabilities of two preeminent large language models, along with other cutting-edge technologies such as speech-to-text conversion, vector semantic searching, and conversational artificial intelligence. Preliminary evaluation shows promising results in building a natural language interface for the FHIR clinical terminology system.


Assuntos
Processamento de Linguagem Natural , Interface Usuário-Computador , Terminologia como Assunto , Interoperabilidade da Informação em Saúde , Vocabulário Controlado , Armazenamento e Recuperação da Informação/métodos , Humanos , Registros Eletrônicos de Saúde/classificação , Semântica , Inteligência Artificial
12.
Trop Med Infect Dis ; 9(9)2024 Sep 16.
Artigo em Inglês | MEDLINE | ID: mdl-39330905

RESUMO

Malaria and Typhoid fever are prevalent diseases in tropical regions, and both are exacerbated by unclear protocols, drug resistance, and environmental factors. Prompt and accurate diagnosis is crucial to improve accessibility and reduce mortality rates. Traditional diagnosis methods cannot effectively capture the complexities of these diseases due to the presence of similar symptoms. Although machine learning (ML) models offer accurate predictions, they operate as "black boxes" with non-interpretable decision-making processes, making it challenging for healthcare providers to comprehend how the conclusions are reached. This study employs explainable AI (XAI) models such as Local Interpretable Model-agnostic Explanations (LIME), and Large Language Models (LLMs) like GPT to clarify diagnostic results for healthcare workers, building trust and transparency in medical diagnostics by describing which symptoms had the greatest impact on the model's decisions and providing clear, understandable explanations. The models were implemented on Google Colab and Visual Studio Code because of their rich libraries and extensions. Results showed that the Random Forest model outperformed the other tested models; in addition, important features were identified with the LIME plots while ChatGPT 3.5 had a comparative advantage over other LLMs. The study integrates RF, LIME, and GPT in building a mobile app to enhance the interpretability and transparency in malaria and typhoid diagnosis system. Despite its promising results, the system's performance is constrained by the quality of the dataset. Additionally, while LIME and GPT improve transparency, they may introduce complexities in real-time deployment due to computational demands and the need for internet service to maintain relevance and accuracy. The findings suggest that AI-driven diagnostic systems can significantly enhance healthcare delivery in environments with limited resources, and future works can explore the applicability of this framework to other medical conditions and datasets.

13.
BMC Med Educ ; 24(1): 1013, 2024 Sep 16.
Artigo em Inglês | MEDLINE | ID: mdl-39285377

RESUMO

BACKGROUND: ChatGPT, a recently developed artificial intelligence (AI) chatbot, has demonstrated improved performance in examinations in the medical field. However, thus far, an overall evaluation of the potential of ChatGPT models (ChatGPT-3.5 and GPT-4) in a variety of national health licensing examinations is lacking. This study aimed to provide a comprehensive assessment of the ChatGPT models' performance in national licensing examinations for medical, pharmacy, dentistry, and nursing research through a meta-analysis. METHODS: Following the PRISMA protocol, full-text articles from MEDLINE/PubMed, EMBASE, ERIC, Cochrane Library, Web of Science, and key journals were reviewed from the time of ChatGPT's introduction to February 27, 2024. Studies were eligible if they evaluated the performance of a ChatGPT model (ChatGPT-3.5 or GPT-4); related to national licensing examinations in the fields of medicine, pharmacy, dentistry, or nursing; involved multiple-choice questions; and provided data that enabled the calculation of effect size. Two reviewers independently completed data extraction, coding, and quality assessment. The JBI Critical Appraisal Tools were used to assess the quality of the selected articles. Overall effect size and 95% confidence intervals [CIs] were calculated using a random-effects model. RESULTS: A total of 23 studies were considered for this review, which evaluated the accuracy of four types of national licensing examinations. The selected articles were in the fields of medicine (n = 17), pharmacy (n = 3), nursing (n = 2), and dentistry (n = 1). They reported varying accuracy levels, ranging from 36 to 77% for ChatGPT-3.5 and 64.4-100% for GPT-4. The overall effect size for the percentage of accuracy was 70.1% (95% CI, 65-74.8%), which was statistically significant (p < 0.001). Subgroup analyses revealed that GPT-4 demonstrated significantly higher accuracy in providing correct responses than its earlier version, ChatGPT-3.5. Additionally, in the context of health licensing examinations, the ChatGPT models exhibited greater proficiency in the following order: pharmacy, medicine, dentistry, and nursing. However, the lack of a broader set of questions, including open-ended and scenario-based questions, and significant heterogeneity were limitations of this meta-analysis. CONCLUSIONS: This study sheds light on the accuracy of ChatGPT models in four national health licensing examinations across various countries and provides a practical basis and theoretical support for future research. Further studies are needed to explore their utilization in medical and health education by including a broader and more diverse range of questions, along with more advanced versions of AI chatbots.


Assuntos
Inteligência Artificial , Avaliação Educacional , Licenciamento , Humanos , Educação em Enfermagem/normas , Avaliação Educacional/métodos , Avaliação Educacional/normas , Licenciamento/normas , Educação em Farmácia/normas , Educação Médica/normas , Educação em Odontologia/normas
14.
JMIR Med Educ ; 10: e52346, 2024 Sep 27.
Artigo em Inglês | MEDLINE | ID: mdl-39331527

RESUMO

Unlabelled: Instructional and clinical technologies have been transforming dental education. With the emergence of artificial intelligence (AI), the opportunities of using AI in education has increased. With the recent advancement of generative AI, large language models (LLMs) and foundation models gained attention with their capabilities in natural language understanding and generation as well as combining multiple types of data, such as text, images, and audio. A common example has been ChatGPT, which is based on a powerful LLM-the GPT model. This paper discusses the potential benefits and challenges of incorporating LLMs in dental education, focusing on periodontal charting with a use case to outline capabilities of LLMs. LLMs can provide personalized feedback, generate case scenarios, and create educational content to contribute to the quality of dental education. However, challenges, limitations, and risks exist, including bias and inaccuracy in the content created, privacy and security concerns, and the risk of overreliance. With guidance and oversight, and by effectively and ethically integrating LLMs, dental education can incorporate engaging and personalized learning experiences for students toward readiness for real-life clinical practice.


Assuntos
Inteligência Artificial , Educação em Odontologia , Humanos , Educação em Odontologia/métodos , Modelos Educacionais
15.
JMIR Ment Health ; 11: e53778, 2024 Sep 26.
Artigo em Inglês | MEDLINE | ID: mdl-39324852

RESUMO

Background: Motivational interviewing (MI) is a therapeutic technique that has been successful in helping smokers reduce smoking but has limited accessibility due to the high cost and low availability of clinicians. To address this, the MIBot project has sought to develop a chatbot that emulates an MI session with a client with the specific goal of moving an ambivalent smoker toward the direction of quitting. One key element of an MI conversation is reflective listening, where a therapist expresses their understanding of what the client has said by uttering a reflection that encourages the client to continue their thought process. Complex reflections link the client's responses to relevant ideas and facts to enhance this contemplation. Backward-looking complex reflections (BLCRs) link the client's most recent response to a relevant selection of the client's previous statements. Our current chatbot can generate complex reflections-but not BLCRs-using large language models (LLMs) such as GPT-2, which allows the generation of unique, human-like messages customized to client responses. Recent advancements in these models, such as the introduction of GPT-4, provide a novel way to generate complex text by feeding the models instructions and conversational history directly, making this a promising approach to generate BLCRs. Objective: This study aims to develop a method to generate BLCRs for an MI-based smoking cessation chatbot and to measure the method's effectiveness. Methods: LLMs such as GPT-4 can be stimulated to produce specific types of responses to their inputs by "asking" them with an English-based description of the desired output. These descriptions are called prompts, and the goal of writing a description that causes an LLM to generate the required output is termed prompt engineering. We evolved an instruction to prompt GPT-4 to generate a BLCR, given the portions of the transcript of the conversation up to the point where the reflection was needed. The approach was tested on 50 previously collected MIBot transcripts of conversations with smokers and was used to generate a total of 150 reflections. The quality of the reflections was rated on a 4-point scale by 3 independent raters to determine whether they met specific criteria for acceptability. Results: Of the 150 generated reflections, 132 (88%) met the level of acceptability. The remaining 18 (12%) had one or more flaws that made them inappropriate as BLCRs. The 3 raters had pairwise agreement on 80% to 88% of these scores. Conclusions: The method presented to generate BLCRs is good enough to be used as one source of reflections in an MI-style conversation but would need an automatic checker to eliminate the unacceptable ones. This work illustrates the power of the new LLMs to generate therapeutic client-specific responses under the command of a language-based specification.


Assuntos
Algoritmos , Entrevista Motivacional , Abandono do Hábito de Fumar , Humanos , Abandono do Hábito de Fumar/métodos , Abandono do Hábito de Fumar/psicologia , Entrevista Motivacional/métodos , Adulto , Feminino , Masculino , Pessoa de Meia-Idade
16.
BMC Med Educ ; 24(1): 1060, 2024 Sep 27.
Artigo em Inglês | MEDLINE | ID: mdl-39334087

RESUMO

BACKGROUND: Multiple choice questions are heavily used in medical education assessments, but rely on recognition instead of knowledge recall. However, grading open questions is a time-intensive task for teachers. Automatic short answer grading (ASAG) has tried to fill this gap, and with the recent advent of Large Language Models (LLM), this branch has seen a new momentum. METHODS: We graded 2288 student answers from 12 undergraduate medical education courses in 3 languages using GPT-4 and Gemini 1.0 Pro. RESULTS: GPT-4 proposed significantly lower grades than the human evaluator, but reached low rates of false positives. The grades of Gemini 1.0 Pro were not significantly different from the teachers'. Both LLMs reached a moderate agreement with human grades, and a high precision for GPT-4 among answers considered fully correct. A consistent grading behavior could be determined for high-quality keys. A weak correlation was found wrt. the length or language of student answers. There is a risk of bias if the LLM knows the human grade a priori. CONCLUSIONS: LLM-based ASAG applied to medical education still requires human oversight, but time can be spared on the edge cases, allowing teachers to focus on the middle ones. For Bachelor-level medical education questions, the training knowledge of LLMs seems to be sufficient, fine-tuning is thus not necessary.


Assuntos
Educação de Graduação em Medicina , Avaliação Educacional , Educação de Graduação em Medicina/métodos , Humanos , Avaliação Educacional/métodos , Idioma , Estudantes de Medicina
17.
J Dent Sci ; 19(4): 2262-2267, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39347065

RESUMO

Background/purpose: Large language models (LLMs) such as OpenAI's ChatGPT, Google's Bard, and Microsoft's Bing Chat have shown potential as educational tools in the medical and dental fields. This study evaluated their effectiveness using questions from the Japanese national dental hygienist examination, focusing on textual information only. Materials and methods: We analyzed 73 questions from the 32nd Japanese national dental hygienist examination, conducted in March 2023, using LLMs ChatGPT-3.5, GPT-4, Bard, and Bing Chat. Each question was categorized into one of nine domains. Standardized prompts were used for all LLMs, and Fisher's exact test was applied for statistical analysis. Results: GPT-4 achieved the highest accuracy (75.3%), followed by Bing (68.5%), Bard (66.7%), and GPT-3.5 (63.0%). There were no statistically significant differences between the LLMs. The performance varied across different question categories, with all models excelling in the 'Disease mechanism and promotion of recovery process' category (100% accuracy). GPT-4 generally outperformed other models, especially in multi-answer questions. Conclusion: GPT-4 demonstrated the highest overall accuracy among the LLMs tested, indicating its superior potential as an educational support tool in dental hygiene studies. The study highlights the varied performance of different LLMs across various question categories. While GPT-4 is currently the most effective, the capabilities of LLMs in educational settings are subject to continual change and improvement.

18.
Res Sq ; 2024 Sep 11.
Artigo em Inglês | MEDLINE | ID: mdl-39315262

RESUMO

Background: HPV vaccine is an effective measure to prevent and control the diseases caused by Human Papillomavirus (HPV). This study addresses the development of VaxBot-HPV, a chatbot aimed at improving health literacy and promoting vaccination uptake by providing information and answering questions about the HPV vaccine. Methods: We constructed the knowledge base (KB) for VaxBot-HPV, which consists of 451 documents from biomedical literature and web sources on the HPV vaccine. We extracted 202 question-answer pairs from the KB and 39 questions generated by GPT-4 for training and testing purposes. To comprehensively understand the capabilities and potential of GPT-based chatbots, three models were involved in this study : GPT-3.5, VaxBot-HPV, and GPT-4. The evaluation criteria included answer relevancy and faithfulness. Results: VaxBot-HPV demonstrated superior performance in answer relevancy and faithfulness compared to baselines (Answer relevancy: 0.85; Faithfulness: 0.97) for the test questions in KB, (Answer relevancy: 0.85; Faithfulness: 0.96) for GPT generated questions. Conclusions: This study underscores the importance of leveraging advanced language models and fine-tuning techniques in the development of chatbots for healthcare applications, with implications for improving medical education and public health communication.

19.
medRxiv ; 2024 Sep 06.
Artigo em Inglês | MEDLINE | ID: mdl-39281744

RESUMO

Background and Aims: Patient-reported outcomes (PROs) are vital in assessing disease activity and treatment outcomes in inflammatory bowel disease (IBD). However, manual extraction of these PROs from the free-text of clinical notes is burdensome. We aimed to improve data curation from free-text information in the electronic health record, making it more available for research and quality improvement. This study aimed to compare traditional natural language processing (tNLP) and large language models (LLMs) in extracting three IBD PROs (abdominal pain, diarrhea, fecal blood) from clinical notes across two institutions. Methods: Clinic notes were annotated for each PRO using preset protocols. Models were developed and internally tested at the University of California San Francisco (UCSF), and then externally validated at Stanford University. We compared tNLP and LLM-based models on accuracy, sensitivity, specificity, positive and negative predictive value. Additionally, we conducted fairness and error assessments. Results: Inter-rater reliability between annotators was >90%. On the UCSF test set (n=50), the top-performing tNLP models showcased accuracies of 92% (abdominal pain), 82% (diarrhea) and 80% (fecal blood), comparable to GPT-4, which was 96%, 88%, and 90% accurate, respectively. On external validation at Stanford (n=250), tNLP models failed to generalize (61-62% accuracy) while GPT-4 maintained accuracies >90%. PaLM-2 and GPT-4 showed similar performance. No biases were detected based on demographics or diagnosis. Conclusions: LLMs are accurate and generalizable methods for extracting PROs. They maintain excellent accuracy across institutions, despite heterogeneity in note templates and authors. Widespread adoption of such tools has the potential to enhance IBD research and patient care.

20.
Nucl Med Mol Imaging ; 58(6): 323-331, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39308492

RESUMO

The rapid advancements in natural language processing, particularly with the development of Generative Pre-trained Transformer (GPT) models, have opened up new avenues for researchers across various domains. This review article explores the potential of GPT as a research tool, focusing on the core functionalities, key features, and real-world applications of the GPT-4 model. We delve into the concept of prompt engineering, a crucial technique for effectively utilizing GPT, and provide guidelines for designing optimal prompts. Through case studies, we demonstrate how GPT can be applied at various stages of the research process, including literature review, data analysis, and manuscript preparation. The utilization of GPT is expected to enhance research efficiency, stimulate creative thinking, facilitate interdisciplinary collaboration, and increase the impact of research findings. However, it is essential to view GPT as a complementary tool rather than a substitute for human expertise, keeping in mind its limitations and ethical considerations. As GPT continues to evolve, researchers must develop a deep understanding of this technology and leverage its potential to advance their research endeavors while being mindful of its implications.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA