Evolving Landscape of Large Language Models: An Evaluation of ChatGPT and Bard in Answering Patient Queries on Colonoscopy
2
Department of Internal Medicine, Rochester General Hospital, NY
Publication type: Journal Article
Publication date: 2024-01-01
scimago Q1
wos Q1
SJR: 7.195
CiteScore: 39.5
Impact factor: 25.1
ISSN: 00165085, 15280012
PubMed ID:
37634736
Gastroenterology
Hepatology
Abstract
The popularity and implementation of artificial intelligence (AI) enabled large language models (LLMs) powering chatbots are promising in healthcare, especially for patient queries and communication 1 .Lee et al 2 evaluated the performance of ChatGPT (version January 2023) for answering eight common patient questions related to colonoscopy and compared it to the responses available on hospital webpages.The study concluded that the performance of ChatGPT answers were similar to non-AI answers in ease of understanding and scientific adequacy.However, it must be noted that GPT-3.5 was used for evaluation.Since then, major updates have been released, specifically ChatGPT-4.0by OpenAI and updated Bard by Google in March 2023 and July 2023 3,4 , respectively.Due to the rapid evolution of LLMs, we compared the performance of three LLMs including ChatGPT-3.5,ChatGPT-4, and Bard (version July 2023, queries ran on July 17 th , 2023) to answer common patient inquiries related to colonoscopy.All three LLMs were given 47 questions twice and responses were recorded for both answers to access for consistency (Supplementary Table ).Responses were scored by two reviewers on a scale of 0-2 (completely correct=2, correct but incomplete=1, incorrect=0).Disagreement among scores between the two reviewers (2 fellows) was resolved by a third gastroenterologist and a single final rating was included in the results.Answers were considered "unreliable" if the two responses for the same query were inconsistent and had differences in meanings.Each response was evaluated in two simulated scenarios: firstly, as replies on a patient-oriented information platform, like informational websites run by hospitals; and secondly, as AI-crafted preliminary responses to electronic message queries sent by patients, intended for healthcare provider review.Among the three models, the performance of ChatGPT-4 was considered highest with 43 out of 47 (91.4%) responses graded as completely correct, while rest (8.6%) of the responses by ChatGPT-4 were graded as correct but incomplete.None of the responses were graded as incorrect for ChatGPT-4.For ChatGPT-3.5, of the 47 questions, only three (6.4%) were graded as completely correct, 40 (85.1%)we graded as correct but incomplete and 4 (8.5%) were considered as incorrect.Similar results were seen with Google's Bard LLM.Only seven (14.9%) J o u r n a l P r e -p r o o f responses were graded as completely correct, 30 (63.8%) were graded as correct but incomplete and 10 (21.3%) were considered as incorrect.None of the response were considered unreliable for ChatGPT-4 and ChatGPT-3.5,however two responses generated by Bard were considered unreliable.Although our study did not compare the performance of AI chatbots with human responses, our results highlight that various LLMs have differences in their performance as judged by expert gastroenterologists.As the landscape of LLMs has been evolving at an unprecedented pace, studies that utilize ChatGPT-3.5 may not be reliable and accurately represent the full potential and nuances of current LLM capabilities.The improvements in ChatGPT-4, although still with limitations, offers enhanced performance metrics and better understanding of context with a broader knowledge base and should be used in future studies.As LLMs continue to progress and become more prominent in medical use, there is a significant need for standardized regulation of performance metrics to ensure consistency, comparability, and reliability across different models 5 .For these models to be both clinically relevant and robust, it is vital that the medical research community be involved in their development, understanding their metrics and stay current with these rapid advancements.
Found
Nothing found, try to update filter.
Found
Nothing found, try to update filter.
Top-30
Journals
|
1
2
|
|
|
Liver International
2 publications, 6.9%
|
|
|
Clinical Gastroenterology and Hepatology
2 publications, 6.9%
|
|
|
Journal of Medical Internet Research
2 publications, 6.9%
|
|
|
JMIR Medical Informatics
2 publications, 6.9%
|
|
|
Gastro Hep Advances
1 publication, 3.45%
|
|
|
Schweizer Gastroenterologie
1 publication, 3.45%
|
|
|
Current Gastroenterology Reports
1 publication, 3.45%
|
|
|
Mayo Clinic Proceedings Digital Health
1 publication, 3.45%
|
|
|
Radiology
1 publication, 3.45%
|
|
|
Gastroenterology
1 publication, 3.45%
|
|
|
npj Digital Medicine
1 publication, 3.45%
|
|
|
Alimentary Pharmacology and Therapeutics
1 publication, 3.45%
|
|
|
International Journal of Colorectal Disease
1 publication, 3.45%
|
|
|
Laryngoscope
1 publication, 3.45%
|
|
|
Cancers
1 publication, 3.45%
|
|
|
Big Data and Cognitive Computing
1 publication, 3.45%
|
|
|
Bioengineering
1 publication, 3.45%
|
|
|
Healthcare
1 publication, 3.45%
|
|
|
Pharmacoepidemiology and Drug Safety
1 publication, 3.45%
|
|
|
International Journal of Medical Informatics
1 publication, 3.45%
|
|
|
Applied Sciences (Switzerland)
1 publication, 3.45%
|
|
|
Endoscopy International Open
1 publication, 3.45%
|
|
|
Cureus
1 publication, 3.45%
|
|
|
JMIR Medical Education
1 publication, 3.45%
|
|
|
1
2
|
Publishers
|
1
2
3
4
5
6
|
|
|
Elsevier
6 publications, 20.69%
|
|
|
Springer Nature
5 publications, 17.24%
|
|
|
Wiley
5 publications, 17.24%
|
|
|
MDPI
5 publications, 17.24%
|
|
|
JMIR Publications
5 publications, 17.24%
|
|
|
Radiological Society of North America (RSNA)
1 publication, 3.45%
|
|
|
Institute of Electrical and Electronics Engineers (IEEE)
1 publication, 3.45%
|
|
|
Georg Thieme Verlag KG
1 publication, 3.45%
|
|
|
1
2
3
4
5
6
|
- We do not take into account publications without a DOI.
- Statistics recalculated weekly.
Are you a researcher?
Create a profile to get free access to personal recommendations for colleagues and new articles.
Metrics
29
Total citations:
29
Citations from 2024:
27
(93.1%)
Cite this
GOST |
RIS |
BibTex |
MLA
Cite this
GOST
Copy
Tariq R., Malik S., Khanna S. Evolving Landscape of Large Language Models: An Evaluation of ChatGPT and Bard in Answering Patient Queries on Colonoscopy // Gastroenterology. 2024. Vol. 166. No. 1. pp. 220-221.
GOST all authors (up to 50)
Copy
Tariq R., Malik S., Khanna S. Evolving Landscape of Large Language Models: An Evaluation of ChatGPT and Bard in Answering Patient Queries on Colonoscopy // Gastroenterology. 2024. Vol. 166. No. 1. pp. 220-221.
Cite this
RIS
Copy
TY - JOUR
DO - 10.1053/j.gastro.2023.08.033
UR - https://doi.org/10.1053/j.gastro.2023.08.033
TI - Evolving Landscape of Large Language Models: An Evaluation of ChatGPT and Bard in Answering Patient Queries on Colonoscopy
T2 - Gastroenterology
AU - Tariq, Raseen
AU - Malik, Sheza
AU - Khanna, Sahil
PY - 2024
DA - 2024/01/01
PB - Elsevier
SP - 220-221
IS - 1
VL - 166
PMID - 37634736
SN - 0016-5085
SN - 1528-0012
ER -
Cite this
BibTex (up to 50 authors)
Copy
@article{2024_Tariq,
author = {Raseen Tariq and Sheza Malik and Sahil Khanna},
title = {Evolving Landscape of Large Language Models: An Evaluation of ChatGPT and Bard in Answering Patient Queries on Colonoscopy},
journal = {Gastroenterology},
year = {2024},
volume = {166},
publisher = {Elsevier},
month = {jan},
url = {https://doi.org/10.1053/j.gastro.2023.08.033},
number = {1},
pages = {220--221},
doi = {10.1053/j.gastro.2023.08.033}
}
Cite this
MLA
Copy
Tariq, Raseen, et al. “Evolving Landscape of Large Language Models: An Evaluation of ChatGPT and Bard in Answering Patient Queries on Colonoscopy.” Gastroenterology, vol. 166, no. 1, Jan. 2024, pp. 220-221. https://doi.org/10.1053/j.gastro.2023.08.033.