Open Access
Open access
Bioengineering, volume 12, issue 1, pages 1

The Potential Clinical Utility of the Customized Large Language Model in Gastroenterology: A Pilot Study

Eun Jeong Gong 1, 2, 3
Chang Seok Bang 1, 2, 3
Jae Jun Lee 3, 4
Jonghyung Park 5
Eunsil Kim 5
Subeen Kim 5
Minjae Kimm 6
Seoung‐Ho Choi 7
Publication typeJournal Article
Publication date2024-12-24
Journal: Bioengineering
scimago Q3
SJR0.627
CiteScore4.0
Impact factor3.8
ISSN23065354
Abstract

Background: The large language model (LLM) has the potential to be applied to clinical practice. However, there has been scarce study on this in the field of gastroenterology. Aim: This study explores the potential clinical utility of two LLMs in the field of gastroenterology: a customized GPT model and a conventional GPT-4o, an advanced LLM capable of retrieval-augmented generation (RAG). Method: We established a customized GPT with the BM25 algorithm using Open AI’s GPT-4o model, which allows it to produce responses in the context of specific documents including textbooks of internal medicine (in English) and gastroenterology (in Korean). Also, we prepared a conventional ChatGPT 4o (accessed on 16 October 2024) access. The benchmark (written in Korean) consisted of 15 clinical questions developed by four clinical experts, representing typical questions for medical students. The two LLMs, a gastroenterology fellow, and an expert gastroenterologist were tested to assess their performance. Results: While the customized LLM correctly answered 8 out of 15 questions, the fellow answered 10 correctly. When the standardized Korean medical terms were replaced with English terminology, the LLM’s performance improved, answering two additional knowledge-based questions correctly, matching the fellow’s score. However, judgment-based questions remained a challenge for the model. Even with the implementation of ‘Chain of Thought’ prompt engineering, the customized GPT did not achieve improved reasoning. Conventional GPT-4o achieved the highest score among the AI models (14/15). Although both models performed slightly below the expert gastroenterologist’s level (15/15), they show promising potential for clinical applications (scores comparable with or higher than that of the gastroenterology fellow). Conclusions: LLMs could be utilized to assist with specialized tasks such as patient counseling. However, RAG capabilities by enabling real-time retrieval of external data not included in the training dataset, appear essential for managing complex, specialized content, and clinician oversight will remain crucial to ensure safe and effective use in clinical practice.

Zheng N.S., Keloth V.K., You K., Kats D., Li D.K., Deshpande O., Sachar H., Xu H., Laine L., Shung D.L.
Gastroenterology scimago Q1 wos Q1
2025-01-01 citations by CoLab: 1 Abstract  
Early identification and accurate characterization of overt gastrointestinal bleeding (GIB) enables opportunities to optimize patient management and ensures appropriately risk-adjusted coding for claims-based quality measures and reimbursement. Recent advancements in generative artificial intelligence, particularly large language models (LLMs), create opportunities to support accurate identification of clinical conditions. In this study, we present the first LLM-based pipeline for identification of overt GIB in the electronic health record (EHR). We demonstrate two clinically relevant applications: the automated detection of recurrent bleeding and appropriate reimbursement coding for patients with GIB.
Toiv A., Saleh Z., Ishak A., Alsheik E., Venkat D., Nandi N., Zuchelli T.E.
2024-08-30 citations by CoLab: 1 Abstract  
INTRODUCTION: The advent of artificial intelligence–powered large language models capable of generating interactive responses to intricate queries marks a groundbreaking development in how patients access medical information. Our aim was to evaluate the appropriateness and readability of gastroenterological information generated by Chat Generative Pretrained Transformer (ChatGPT). METHODS: We analyzed responses generated by ChatGPT to 16 dialog-based queries assessing symptoms and treatments for gastrointestinal conditions and 13 definition-based queries on prevalent topics in gastroenterology. Three board-certified gastroenterologists evaluated output appropriateness with a 5-point Likert-scale proxy measurement of currency, relevance, accuracy, comprehensiveness, clarity, and urgency/next steps. Outputs with a score of 4 or 5 in all 6 categories were designated as “appropriate.” Output readability was assessed with Flesch Reading Ease score, Flesch-Kinkaid Reading Level, and Simple Measure of Gobbledygook scores. RESULTS: ChatGPT responses to 44% of the 16 dialog-based and 69% of the 13 definition-based questions were deemed appropriate, and the proportion of appropriate responses within the 2 groups of questions was not significantly different (P = 0.17). Notably, none of ChatGPT’s responses to questions related to gastrointestinal emergencies were designated appropriate. The mean readability scores showed that outputs were written at a college-level reading proficiency. DISCUSSION: ChatGPT can produce generally fitting responses to gastroenterological medical queries, but responses were constrained in appropriateness and readability, which limits the current utility of this large language model. Substantial development is essential before these models can be unequivocally endorsed as reliable sources of medical information.
Gong E.J., Bang C.S.
2024-08-07 citations by CoLab: 4 Abstract  
This letter evaluates the article by Gravina et al on ChatGPT’s potential in providing medical information for inflammatory bowel disease patients. While promising, it highlights the need for advanced techniques like reasoning + action and retrieval-augmented generation to improve accuracy and reliability. Emphasizing that simple question and answer testing is insufficient, it calls for more nuanced evaluation methods to truly gauge large language models’ capabilities in clinical applications.
Gravina A.G., Pellegrino R., Palladino G., Imperio G., Ventura A., Federico A.
Digestive and Liver Disease scimago Q2 wos Q1
2024-08-01 citations by CoLab: 20 Abstract  
Conversational chatbots, fueled by large language models, spark debate over their potential in education and medical career exams. There is debate in the literature about the scientific integrity of the outputs produced by these chatbots.This study evaluates ChatGPT 3.5 and Perplexity AI's cross-sectional performance in responding to questions from the 2023 Italian national residency admission exam (SSM23), comparing results and chatbots' concordance with previous years SSMs.Gastroenterology-related SSM23 questions were input into ChatGPT 3.5 and Perplexity AI, evaluating their performance in correct responses and total scores. This process was repeated with questions from the three preceding years. Additionally, chatbot concordance was assessed using Cohen's method.In SSM23, ChatGPT 3.5 outperforms Perplexity AI with 94.11% correct responses, demonstrating consistency across years. Concordance weakened in 2023 (κ=0.203, P = 0.148), but ChatGPT consistently maintains a high standard compared to Perplexity AI.ChatGPT 3.5 and Perplexity AI exhibit promise in addressing gastroenterological queries, emphasizing potential educational roles. However, their variable performance mandates cautious use as supplementary tools alongside conventional study methods. Clear guidelines are crucial for educators to balance traditional approaches and innovative systems, enhancing educational standards.
Xie S., Zhao W., Deng G., He G., He N., Lu Z., Hu W., Zhao M., Du J.
2024-05-17 citations by CoLab: 6 Abstract  
Abstract Objective Synthesizing and evaluating inconsistent medical evidence is essential in evidence-based medicine. This study aimed to employ ChatGPT as a sophisticated scientific reasoning engine to identify conflicting clinical evidence and summarize unresolved questions to inform further research. Materials and Methods We evaluated ChatGPT’s effectiveness in identifying conflicting evidence and investigated its principles of logical reasoning. An automated framework was developed to generate a PubMed dataset focused on controversial clinical topics. ChatGPT analyzed this dataset to identify consensus and controversy, and to formulate unsolved research questions. Expert evaluations were conducted 1) on the consensus and controversy for factual consistency, comprehensiveness, and potential harm and, 2) on the research questions for relevance, innovation, clarity, and specificity. Results The gpt-4-1106-preview model achieved a 90% recall rate in detecting inconsistent claim pairs within a ternary assertions setup. Notably, without explicit reasoning prompts, ChatGPT provided sound reasoning for the assertions between claims and hypotheses, based on an analysis grounded in relevance, specificity, and certainty. ChatGPT’s conclusions of consensus and controversies in clinical literature were comprehensive and factually consistent. The research questions proposed by ChatGPT received high expert ratings. Discussion Our experiment implies that, in evaluating the relationship between evidence and claims, ChatGPT considered more detailed information beyond a straightforward assessment of sentimental orientation. This ability to process intricate information and conduct scientific reasoning regarding sentiment is noteworthy, particularly as this pattern emerged without explicit guidance or directives in prompts, highlighting ChatGPT’s inherent logical reasoning capabilities. Conclusion This study demonstrated ChatGPT’s capacity to evaluate and interpret scientific claims. Such proficiency can be generalized to broader clinical research literature. ChatGPT effectively aids in facilitating clinical studies by proposing unresolved challenges based on analysis of existing studies. However, caution is advised as ChatGPT’s outputs are inferences drawn from the input literature and could be harmful to clinical practice.
Khamaysi I., Klein A., Gorelik Y., Ghersin I., Arraf T., Ben-Ishay O.
2024-03-18 citations by CoLab: 8 PDF Abstract  
Abstract Background and study aims Rising prevalence of pancreatic cysts and inconsistent management guidelines necessitate innovative approaches. New features of large language models (LLMs), namely custom GPT creation, provided by ChatGPT can be utilized to integrate multiple guidelines and settle inconsistencies. Methods A custom GPT was developed to provide guideline-based management advice for pancreatic cysts. Sixty clinical scenarios were evaluated by both the custom GPT and gastroenterology experts. A consensus was reached between experts and review of guidelines and the accuracy of recommendations provided by the custom GPT was evaluated and compared with experts. Results The custom GPT aligned with expert recommendations in 87% of scenarios. Initial expert recommendations were correct in 97% and 87% of cases, respectively. No significant difference was observed between the accuracy of custom GPT and the experts. Agreement analysis using Cohen's and Fleiss' Kappa coefficients indicated consistency among experts and the custom GPT. Conclusions This proof-of-concept study shows the custom GPT's potential to provide accurate, guideline-based recommendations for pancreatic cyst management, comparable to expert opinions. The study highlights the role of advanced features of LLMs in enhancing clinical decision-making in fields with significant practice variability.
Atarere J., Naqvi H., Haas C., Adewunmi C., Bandaru S., Allamneni R., Ugonabo O., Egbo O., Umoren M., Kanth P.
Digestive Diseases and Sciences scimago Q1 wos Q2
2024-01-24 citations by CoLab: 13 Abstract  
Over the past year, studies have shown potential in the applicability of ChatGPT in various medical specialties including cardiology and oncology. However, the application of ChatGPT and other online chat-based AI models to patient education and patient-physician communication on colorectal cancer screening has not been critically evaluated which is what we aimed to do in this study. We posed 15 questions on important colorectal cancer screening concepts and 5 common questions asked by patients to the 3 most commonly used freely available artificial intelligence (AI) models. The responses provided by the AI models were graded for appropriateness and reliability using American College of Gastroenterology guidelines. The responses to each question provided by an AI model were graded as reliably appropriate (RA), reliably inappropriate (RI) and unreliable. Grader assessments were validated by the joint probability of agreement for two raters. ChatGPT and YouChat™ provided RA responses to the questions posed more often than BingChat. There were two questions that > 1 AI model provided unreliable responses to. ChatGPT did not provide references. BingChat misinterpreted some of the information it referenced. The age of CRC screening provided by YouChat™ was not consistently up-to-date. Inter-rater reliability for 2 raters was 89.2%. Most responses provided by AI models on CRC screening were appropriate. Some limitations exist in their ability to correctly interpret medical literature and provide updated information in answering queries. Patients should consult their physicians for context on the recommendations made by these AI models.
Rammohan R., Joy M.V., Magam S.G., Natt D., Magam S.R., Pannikodu L., Desai J., Akande O., Bunting S., Yost R.M., Mustacchia P.
Cureus wos Q3
2024-01-08 citations by CoLab: 6
Tariq R., Malik S., Khanna S.
Gastroenterology scimago Q1 wos Q1
2024-01-01 citations by CoLab: 21 Abstract  
The popularity and implementation of artificial intelligence (AI) enabled large language models (LLMs) powering chatbots are promising in healthcare, especially for patient queries and communication 1 .Lee et al 2 evaluated the performance of ChatGPT (version January 2023) for answering eight common patient questions related to colonoscopy and compared it to the responses available on hospital webpages.The study concluded that the performance of ChatGPT answers were similar to non-AI answers in ease of understanding and scientific adequacy.However, it must be noted that GPT-3.5 was used for evaluation.Since then, major updates have been released, specifically ChatGPT-4.0by OpenAI and updated Bard by Google in March 2023 and July 2023 3,4 , respectively.Due to the rapid evolution of LLMs, we compared the performance of three LLMs including ChatGPT-3.5,ChatGPT-4, and Bard (version July 2023, queries ran on July 17 th , 2023) to answer common patient inquiries related to colonoscopy.All three LLMs were given 47 questions twice and responses were recorded for both answers to access for consistency (Supplementary Table ).Responses were scored by two reviewers on a scale of 0-2 (completely correct=2, correct but incomplete=1, incorrect=0).Disagreement among scores between the two reviewers (2 fellows) was resolved by a third gastroenterologist and a single final rating was included in the results.Answers were considered "unreliable" if the two responses for the same query were inconsistent and had differences in meanings.Each response was evaluated in two simulated scenarios: firstly, as replies on a patient-oriented information platform, like informational websites run by hospitals; and secondly, as AI-crafted preliminary responses to electronic message queries sent by patients, intended for healthcare provider review.Among the three models, the performance of ChatGPT-4 was considered highest with 43 out of 47 (91.4%) responses graded as completely correct, while rest (8.6%) of the responses by ChatGPT-4 were graded as correct but incomplete.None of the responses were graded as incorrect for ChatGPT-4.For ChatGPT-3.5, of the 47 questions, only three (6.4%) were graded as completely correct, 40 (85.1%)we graded as correct but incomplete and 4 (8.5%) were considered as incorrect.Similar results were seen with Google's Bard LLM.Only seven (14.9%) J o u r n a l P r e -p r o o f responses were graded as completely correct, 30 (63.8%) were graded as correct but incomplete and 10 (21.3%) were considered as incorrect.None of the response were considered unreliable for ChatGPT-4 and ChatGPT-3.5,however two responses generated by Bard were considered unreliable.Although our study did not compare the performance of AI chatbots with human responses, our results highlight that various LLMs have differences in their performance as judged by expert gastroenterologists.As the landscape of LLMs has been evolving at an unprecedented pace, studies that utilize ChatGPT-3.5 may not be reliable and accurately represent the full potential and nuances of current LLM capabilities.The improvements in ChatGPT-4, although still with limitations, offers enhanced performance metrics and better understanding of context with a broader knowledge base and should be used in future studies.As LLMs continue to progress and become more prominent in medical use, there is a significant need for standardized regulation of performance metrics to ensure consistency, comparability, and reliability across different models 5 .For these models to be both clinically relevant and robust, it is vital that the medical research community be involved in their development, understanding their metrics and stay current with these rapid advancements.
Ali H., Patel P., Obaitan I., Mohan B.P., Sohail A.H., Smith-Martinez L., Lambert K., Gangwani M.K., Easler J.J., Adler D.G.
iGIE
2023-12-01 citations by CoLab: 10
Samaan J.S., Issokson K., Feldman E., Fasulo C., Ng W.H., Rajeev N., Hollander B., Yeo Y.H., Vasiliauskas E.
2023-10-30 citations by CoLab: 4 Abstract  
ABSTRACTBackground and AimsGenerative Pre-trained Transformer-4 (GPT-4) is a large language model (LLM) trained on a variety of topics, including the medical literature. Nutrition plays a critical role in managing inflammatory bowel disease (IBD), with an unmet need for nutrition-related patient education resources. The aim of this study is to examine the accuracy and reproducibility of responses by GPT-4 to patient nutrition questions related to IBD.MethodsQuestions were curated from adult IBD clinic visits, Facebook, and Reddit. Two IBD-focused registered dieticians independently graded the accuracy and reproducibility of GPT-4’s responses while a third senior IBD-focused registered dietitian arbitrated. To ascertain reproducibility, each question was inputted twice into the model. Descriptive analysis is presented as counts and proportions.ResultsIn total, 88 questions were included. The model provided correct responses to 73/88 questions (83.0%), with 61 (69.0%) graded as comprehensive. A total of 15/88 (17%) responses were graded as mixed with correct and incorrect/outdated data. When examined by category, the model provided comprehensive responses to 10 (62.5%) questions related to “Nutrition and diet needs for surgery”, 12 (92.3%) “Tube feeding and parenteral nutrition”, 11 (64.7%) “General diet questions”, 10 (50%) “Diet for reducing symptoms/inflammation” and 18 (81.8%) “Micronutrients/supplementation needs”. The model provided reproducible responses to 81/88 (92.0%) questions.ConclusionGPT-4 provided comprehensive responses to the majority of questions, demonstrating the promising potential of LLMs as supplementary tools for IBD patients seeking nutrition-related information. However, 17% of responses contained incorrect information, highlighting the need for continuous refinement and validation of LLMs prior to incorporation into clinical practice. Future studies should focus on leveraging LLMs to enhance patient outcomes. Furthermore, efforts promoting patient and healthcare professional proficiency in using LLMs are essential to maximize their efficacy and facilitate personalized care.
Kim H., Gong E., Bang C.
Biomimetics scimago Q2 wos Q3 Open Access
2023-10-28 citations by CoLab: 9 PDF Abstract  
The era of big data has led to the necessity of artificial intelligence models to effectively handle the vast amount of clinical data available. These data have become indispensable resources for machine learning. Among the artificial intelligence models, deep learning has gained prominence and is widely used for analyzing unstructured data. Despite the recent advancement in deep learning, traditional machine learning models still hold significant potential for enhancing healthcare efficiency, especially for structured data. In the field of medicine, machine learning models have been applied to predict diagnoses and prognoses for various diseases. However, the adoption of machine learning models in gastroenterology has been relatively limited compared to traditional statistical models or deep learning approaches. This narrative review provides an overview of the current status of machine learning adoption in gastroenterology and discusses future directions. Additionally, it briefly summarizes recent advances in large language models.
Lim D.Y., Tan Y.B., Koh J.T., Tung J.Y., Sng G.G., Tan D.M., Tan C.
2023-10-19 citations by CoLab: 27 Abstract  
AbstractBackground and AimColonoscopy is commonly used in screening and surveillance for colorectal cancer. Multiple different guidelines provide recommendations on the interval between colonoscopies. This can be challenging for non‐specialist healthcare providers to navigate. Large language models like ChatGPT are a potential tool for parsing patient histories and providing advice. However, the standard GPT model is not designed for medical use and can hallucinate. One way to overcome these challenges is to provide contextual information with medical guidelines to help the model respond accurately to queries.Our study compares the standard GPT4 against a contextualized model provided with relevant screening guidelines. We evaluated whether the models could provide correct advice for screening and surveillance intervals for colonoscopy.MethodsRelevant guidelines pertaining to colorectal cancer screening and surveillance were formulated into a knowledge base for GPT. We tested 62 example case scenarios (three times each) on standard GPT4 and on a contextualized model with the knowledge base.ResultsThe contextualized GPT4 model outperformed the standard GPT4 in all domains. No high‐risk features were missed, and only two cases had hallucination of additional high‐risk features. A correct interval to colonoscopy was provided in the majority of cases. Guidelines were appropriately cited in almost all cases.ConclusionsA contextualized GPT4 model could identify high‐risk features and quote appropriate guidelines without significant hallucination. It gave a correct interval to the next colonoscopy in the majority of cases. This provides proof of concept that ChatGPT with appropriate refinement can serve as an accurate physician assistant.
Gorelik Y., Ghersin I., Maza I., Klein A.
Gastrointestinal Endoscopy scimago Q1 wos Q1
2023-10-01 citations by CoLab: 29 Abstract  
Background and Aims ChatGPT, an advanced language model, is increasingly utilized in diverse fields, including medicine. This study explores using ChatGPT to optimize post-colonoscopy management by providing guideline-based recommendations, addressing low adherence rates and timing issues. Methods In this proof-of-concept study twenty clinical scenarios were prepared as structured reports and free text notes, and ChatGPT's responses were evaluated by two senior gastroenterologists. Adherence to guidelines and accuracy were assessed, and inter-rater agreement was calculated using Fleiss' kappa coefficient. Results ChatGPT exhibited 90% adherence to guidelines and 85% accuracy, with a very good inter-rater agreement (Fleiss' kappa coefficient of 0.84, p

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?
Profiles