Open Access

Journal of Medical Internet Research

, том 26 , страницы e53164

Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis

Тип публикации: Journal Article

Дата публикации: 2024-05-22

JMIR Publications

Journal of Medical Internet Research

SCImago Q1

Tоп 10% SCImago

WOS Q1

БС1

SJR: 2.109

CiteScore: 10.4

Impact factor: 8.2

ISSN: 14394456, 14388871

DOI: 10.2196/53164

Скопировать DOI

PubMed ID: 38776130

Краткое описание

Background

Large language models (LLMs) have raised both interest and concern in the academic community. They offer the potential for automating literature search and synthesis for systematic reviews but raise concerns regarding their reliability, as the tendency to generate unsupported (hallucinated) content persist.

Objective

The aim of the study is to assess the performance of LLMs such as ChatGPT and Bard (subsequently rebranded Gemini) to produce references in the context of scientific writing.

Methods

The performance of ChatGPT and Bard in replicating the results of human-conducted systematic reviews was assessed. Using systematic reviews pertaining to shoulder rotator cuff pathology, these LLMs were tested by providing the same inclusion criteria and comparing the results with original systematic review references, serving as gold standards. The study used 3 key performance metrics: recall, precision, and F1-score, alongside the hallucination rate. Papers were considered “hallucinated” if any 2 of the following information were wrong: title, first author, or year of publication.

Results

In total, 11 systematic reviews across 4 fields yielded 33 prompts to LLMs (3 LLMs×11 reviews), with 471 references analyzed. Precision rates for GPT-3.5, GPT-4, and Bard were 9.4% (13/139), 13.4% (16/119), and 0% (0/104) respectively (P<.001). Recall rates were 11.9% (13/109) for GPT-3.5 and 13.7% (15/109) for GPT-4, with Bard failing to retrieve any relevant papers (P<.001). Hallucination rates stood at 39.6% (55/139) for GPT-3.5, 28.6% (34/119) for GPT-4, and 91.4% (95/104) for Bard (P<.001). Further analysis of nonhallucinated papers retrieved by GPT models revealed significant differences in identifying various criteria, such as randomized studies, participant criteria, and intervention criteria. The study also noted the geographical and open-access biases in the papers retrieved by the LLMs.

Conclusions

Given their current performance, it is not recommended for LLMs to be deployed as the primary or exclusive tool for conducting systematic reviews. Any references generated by such models warrant thorough validation by researchers. The high occurrence of hallucinations in LLMs highlights the necessity for refining their training and functionality before confidently using them for rigorous academic purposes.

Для доступа к списку цитирований публикации необходимо авторизоваться.

Войти с ORCID

Для доступа к списку профилей, цитирующих публикацию, необходимо авторизоваться.

Войти с ORCID

Топ-30

Журналы

	1 2 3 4 5 6 7
medRxiv	medRxiv, 7, 2.73% medRxiv 7 публикаций, 2.73%
JMIR Formative Research	JMIR Formative Research, 4, 1.56% JMIR Formative Research 4 публикации, 1.56%
Frontiers in Education	Frontiers in Education, 4, 1.56% Frontiers in Education 4 публикации, 1.56%
Cureus	Cureus, 3, 1.17% Cureus 3 публикации, 1.17%
Endocrine	Endocrine, 3, 1.17% Endocrine 3 публикации, 1.17%
bioRxiv	bioRxiv, 3, 1.17% bioRxiv 3 публикации, 1.17%
Journal of Medical Systems	Journal of Medical Systems, 3, 1.17% Journal of Medical Systems 3 публикации, 1.17%
Journal of Clinical Medicine	Journal of Clinical Medicine, 3, 1.17% Journal of Clinical Medicine 3 публикации, 1.17%
MedEdPublish	MedEdPublish, 3, 1.17% MedEdPublish 3 публикации, 1.17%
Digital Health	Digital Health, 3, 1.17% Digital Health 3 публикации, 1.17%
Scientific Reports	Scientific Reports, 3, 1.17% Scientific Reports 3 публикации, 1.17%
npj Digital Medicine	npj Digital Medicine, 3, 1.17% npj Digital Medicine 3 публикации, 1.17%
JMIR Medical Informatics	JMIR Medical Informatics, 3, 1.17% JMIR Medical Informatics 3 публикации, 1.17%
AI	AI, 2, 0.78% AI 2 публикации, 0.78%
ACM Transactions on Software Engineering and Methodology	ACM Transactions on Software Engineering and Methodology, 2, 0.78% ACM Transactions on Software Engineering and Methodology 2 публикации, 0.78%
International Journal of Human-Computer Interaction	International Journal of Human-Computer Interaction, 2, 0.78% International Journal of Human-Computer Interaction 2 публикации, 0.78%
Journal of Educational Research	Journal of Educational Research, 2, 0.78% Journal of Educational Research 2 публикации, 0.78%
Applied Sciences (Switzerland)	Applied Sciences (Switzerland), 2, 0.78% Applied Sciences (Switzerland) 2 публикации, 0.78%
Diagnostics	Diagnostics, 2, 0.78% Diagnostics 2 публикации, 0.78%
Assessment and Evaluation in Higher Education	Assessment and Evaluation in Higher Education, 2, 0.78% Assessment and Evaluation in Higher Education 2 публикации, 0.78%
Health Information Science and Systems	Health Information Science and Systems, 2, 0.78% Health Information Science and Systems 2 публикации, 0.78%
Lecture Notes in Computer Science	Lecture Notes in Computer Science, 2, 0.78% Lecture Notes in Computer Science 2 публикации, 0.78%
International Journal of Impotence Research	International Journal of Impotence Research, 2, 0.78% International Journal of Impotence Research 2 публикации, 0.78%
Computers and Education Artificial Intelligence	Computers and Education Artificial Intelligence, 2, 0.78% Computers and Education Artificial Intelligence 2 публикации, 0.78%
PLoS ONE	PLoS ONE, 2, 0.78% PLoS ONE 2 публикации, 0.78%
Discover Public Health	Discover Public Health, 2, 0.78% Discover Public Health 2 публикации, 0.78%
BMC Medical Education	BMC Medical Education, 2, 0.78% BMC Medical Education 2 публикации, 0.78%
Knee Surgery, Sports Traumatology, Arthroscopy	Knee Surgery, Sports Traumatology, Arthroscopy, 2, 0.78% Knee Surgery, Sports Traumatology, Arthroscopy 2 публикации, 0.78%
Société Internationale d’Urologie Journal	Société Internationale d’Urologie Journal, 2, 0.78% Société Internationale d’Urologie Journal 2 публикации, 0.78%
	1 2 3 4 5 6 7

Издатели

	5 10 15 20 25 30 35 40 45 50
Springer Nature	Springer Nature, 48, 18.75% Springer Nature 48 публикаций, 18.75%
Elsevier	Elsevier, 42, 16.41% Elsevier 42 публикации, 16.41%
MDPI	MDPI, 28, 10.94% MDPI 28 публикаций, 10.94%
Taylor & Francis	Taylor & Francis, 21, 8.2% Taylor & Francis 21 публикация, 8.2%
JMIR Publications	JMIR Publications, 15, 5.86% JMIR Publications 15 публикаций, 5.86%
Wiley	Wiley, 14, 5.47% Wiley 14 публикаций, 5.47%
SAGE	SAGE, 12, 4.69% SAGE 12 публикаций, 4.69%
openRxiv	openRxiv, 10, 3.91% openRxiv 10 публикаций, 3.91%
Association for Computing Machinery (ACM)	Association for Computing Machinery (ACM), 10, 3.91% Association for Computing Machinery (ACM) 10 публикаций, 3.91%
Frontiers Media S.A.	Frontiers Media S.A., 10, 3.91% Frontiers Media S.A. 10 публикаций, 3.91%
Institute of Electrical and Electronics Engineers (IEEE)	Institute of Electrical and Electronics Engineers (IEEE), 9, 3.52% Institute of Electrical and Electronics Engineers (IEEE) 9 публикаций, 3.52%
Ovid Technologies (Wolters Kluwer Health)	Ovid Technologies (Wolters Kluwer Health), 9, 3.52% Ovid Technologies (Wolters Kluwer Health) 9 публикаций, 3.52%
Oxford University Press	Oxford University Press, 3, 1.17% Oxford University Press 3 публикации, 1.17%
F1000 Research	F1000 Research, 3, 1.17% F1000 Research 3 публикации, 1.17%
BMJ	BMJ, 2, 0.78% BMJ 2 публикации, 0.78%
Georg Thieme Verlag KG	Georg Thieme Verlag KG, 2, 0.78% Georg Thieme Verlag KG 2 публикации, 0.78%
Public Library of Science (PLoS)	Public Library of Science (PLoS), 2, 0.78% Public Library of Science (PLoS) 2 публикации, 0.78%
De Gruyter Brill	De Gruyter Brill, 2, 0.78% De Gruyter Brill 2 публикации, 0.78%
Emerald	Emerald, 1, 0.39% Emerald 1 публикация, 0.39%
American Chemical Society (ACS)	American Chemical Society (ACS), 1, 0.39% American Chemical Society (ACS) 1 публикация, 0.39%
S. Karger AG	S. Karger AG, 1, 0.39% S. Karger AG 1 публикация, 0.39%
Ediciones Universidad de Salamanca	Ediciones Universidad de Salamanca, 1, 0.39% Ediciones Universidad de Salamanca 1 публикация, 0.39%
Bentham Science Publishers Ltd.	Bentham Science Publishers Ltd., 1, 0.39% Bentham Science Publishers Ltd. 1 публикация, 0.39%
Prague University of Economics and Business	Prague University of Economics and Business, 1, 0.39% Prague University of Economics and Business 1 публикация, 0.39%
American Society of Civil Engineers (ASCE)	American Society of Civil Engineers (ASCE), 1, 0.39% American Society of Civil Engineers (ASCE) 1 публикация, 0.39%
John Benjamins Publishing Company	John Benjamins Publishing Company, 1, 0.39% John Benjamins Publishing Company 1 публикация, 0.39%
Centre for Evaluation in Education and Science (CEON/CEES)	Centre for Evaluation in Education and Science (CEON/CEES), 1, 0.39% Centre for Evaluation in Education and Science (CEON/CEES) 1 публикация, 0.39%
Human Kinetics	Human Kinetics, 1, 0.39% Human Kinetics 1 публикация, 0.39%
Association for Vascular Access	Association for Vascular Access, 1, 0.39% Association for Vascular Access 1 публикация, 0.39%
Izmir Akademi Dernegi	Izmir Akademi Dernegi, 1, 0.39% Izmir Akademi Dernegi 1 публикация, 0.39%
	5 10 15 20 25 30 35 40 45 50

Мы не учитываем публикации, у которых нет DOI.
Статистика публикаций обновляется еженедельно.

Вы ученый?

Создайте профиль, чтобы получать персональные рекомендации коллег, конференций и новых статей.

Войти с ORCID

Метрики

256

Цитировать

ГОСТ |

Цитировать

ГОСТ Скопировать

Chelli M. et al. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis // Journal of Medical Internet Research. 2024. Vol. 26. p. e53164.

ГОСТ со всеми авторами (до 50) Скопировать

Chelli M., Descamps J., Lavoué V., Trojani C., Azar M., Deckert M., Raynier J., Clowez G., BOILEAU P., Ruetsch Chelli C. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis // Journal of Medical Internet Research. 2024. Vol. 26. p. e53164.

RIS |

Цитировать

RIS Скопировать

TY - JOUR

DO - 10.2196/53164

UR - https://www.jmir.org/2024/1/e53164

TI - Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis

T2 - Journal of Medical Internet Research

AU - Chelli, Mikaël

AU - Descamps, Jules

AU - Lavoué, Vincent

AU - Trojani, Christophe

AU - Azar, Michel

AU - Deckert, Marcel

AU - Raynier, Jean-Luc

AU - Clowez, Gilles

AU - BOILEAU, PASCAL

AU - Ruetsch Chelli, Caroline

PY - 2024

DA - 2024/05/22

PB - JMIR Publications

SP - e53164

VL - 26

PMID - 38776130

SN - 1439-4456

SN - 1438-8871

ER -

BibTex

Цитировать

BibTex (до 50 авторов) Скопировать

@article{2024_Chelli,

author = {Mikaël Chelli and Jules Descamps and Vincent Lavoué and Christophe Trojani and Michel Azar and Marcel Deckert and Jean-Luc Raynier and Gilles Clowez and PASCAL BOILEAU and Caroline Ruetsch Chelli},

title = {Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis},

journal = {Journal of Medical Internet Research},

year = {2024},

volume = {26},

publisher = {JMIR Publications},

month = {may},

url = {https://www.jmir.org/2024/1/e53164},

pages = {e53164},

doi = {10.2196/53164}

}

Издатель

JMIR Publications

Журнал

Journal of Medical Internet Research

SCImago Q1

Tоп 10% SCImago

WOS Q1

БС1

SJR

2.109

CiteScore

10.4

Impact factor

8.2

ISSN

14394456 (Print)

14388871 (Electronic)

Ошибка в публикации?